arkouda

Submodules

Attributes

Exceptions

NonUniqueError

Inappropriate argument value (of correct type).

RegistrationError

Error/Exception used when the Arkouda Server cannot register an object

RegistrationError

Error/Exception used when the Arkouda Server cannot register an object

RegistrationError

Error/Exception used when the Arkouda Server cannot register an object

RegistrationError

Error/Exception used when the Arkouda Server cannot register an object

RegistrationError

Error/Exception used when the Arkouda Server cannot register an object

Classes

ARKOUDA_SUPPORTED_DTYPES

frozenset() -> empty frozenset object

ARKOUDA_SUPPORTED_INTS

Built-in immutable sequence.

BitVector

Represent integers as bit vectors, e.g. a set of flags.

BoolDType

DType class corresponding to the scalar type and dtype of the same name.

ByteDType

DType class corresponding to the scalar type and dtype of the same name.

BytesDType

DType class corresponding to the scalar type and dtype of the same name.

CLongDoubleDType

DType class corresponding to the scalar type and dtype of the same name.

CachedAccessor

Custom property-like object.

Categorical

Represents an array of values belonging to named categories. Converting a

Categorical

Represents an array of values belonging to named categories. Converting a

Categorical

Represents an array of values belonging to named categories. Converting a

Complex128DType

DType class corresponding to the scalar type and dtype of the same name.

Complex64DType

DType class corresponding to the scalar type and dtype of the same name.

DType

An enumeration.

DTypeObjects

frozenset() -> empty frozenset object

DTypes

frozenset() -> empty frozenset object

DataFrame

A DataFrame structure based on arkouda arrays.

DataFrame

A DataFrame structure based on arkouda arrays.

DataFrameGroupBy

A DataFrame that has been grouped by a subset of columns.

DataSource

DataSource(destpath='.')

DateTime64DType

DType class corresponding to the scalar type and dtype of the same name.

Datetime

Represents a date and/or time.

Datetime

Represents a date and/or time.

Datetime

Represents a date and/or time.

DatetimeAccessor

DiffAggregate

A column in a GroupBy that has been differenced.

ErrorMode

An enumeration.

False_

Boolean type (True or False), stored as a byte.

Fields

An integer-backed representation of a set of named binary fields, e.g. flags.

Float16DType

DType class corresponding to the scalar type and dtype of the same name.

Float32DType

DType class corresponding to the scalar type and dtype of the same name.

Float64DType

DType class corresponding to the scalar type and dtype of the same name.

GROUPBY_REDUCTION_TYPES

frozenset() -> empty frozenset object

GroupBy

Group an array or list of arrays by value, usually in preparation

GroupBy

Group an array or list of arrays by value, usually in preparation

GroupBy

Group an array or list of arrays by value, usually in preparation

GroupBy

Group an array or list of arrays by value, usually in preparation

GroupBy

Group an array or list of arrays by value, usually in preparation

GroupBy

Group an array or list of arrays by value, usually in preparation

IPv4

Represent integers as IPv4 addresses.

Index

Int16DType

DType class corresponding to the scalar type and dtype of the same name.

Int32DType

DType class corresponding to the scalar type and dtype of the same name.

Int64DType

DType class corresponding to the scalar type and dtype of the same name.

Int8DType

DType class corresponding to the scalar type and dtype of the same name.

IntDType

DType class corresponding to the scalar type and dtype of the same name.

LogLevel

Generic enumeration.

LongDType

DType class corresponding to the scalar type and dtype of the same name.

LongDoubleDType

DType class corresponding to the scalar type and dtype of the same name.

LongLongDType

DType class corresponding to the scalar type and dtype of the same name.

MultiIndex

NUMBER_FORMAT_STRINGS

dict() -> new empty dictionary

NumericDTypes

frozenset() -> empty frozenset object

ObjectDType

DType class corresponding to the scalar type and dtype of the same name.

Power_divergenceResult

The results of a power divergence statistical test.

Properties

RankWarning

Issued by polyfit when the Vandermonde matrix is rank deficient.

Row

This class is useful for printing and working with individual rows of a

ScalarDTypes

frozenset() -> empty frozenset object

ScalarType

Built-in immutable sequence.

SegArray

Series

One-dimensional arkouda array with axis labels.

SeriesDTypes

dict() -> new empty dictionary

ShortDType

DType class corresponding to the scalar type and dtype of the same name.

StrDType

DType class corresponding to the scalar type and dtype of the same name.

StringAccessor

Strings

Represents an array of strings whose data resides on the

Strings

Represents an array of strings whose data resides on the

Strings

Represents an array of strings whose data resides on the

Strings

Represents an array of strings whose data resides on the

Strings

Represents an array of strings whose data resides on the

TimeDelta64DType

DType class corresponding to the scalar type and dtype of the same name.

Timedelta

Represents a duration, the difference between two dates or times.

Timedelta

Represents a duration, the difference between two dates or times.

TooHardError

max_work was exceeded.

True_

Boolean type (True or False), stored as a byte.

UByteDType

DType class corresponding to the scalar type and dtype of the same name.

UInt16DType

DType class corresponding to the scalar type and dtype of the same name.

UInt32DType

DType class corresponding to the scalar type and dtype of the same name.

UInt64DType

DType class corresponding to the scalar type and dtype of the same name.

UInt8DType

DType class corresponding to the scalar type and dtype of the same name.

UIntDType

DType class corresponding to the scalar type and dtype of the same name.

ULongDType

DType class corresponding to the scalar type and dtype of the same name.

ULongLongDType

DType class corresponding to the scalar type and dtype of the same name.

UShortDType

DType class corresponding to the scalar type and dtype of the same name.

VoidDType

DType class corresponding to the scalar type and dtype of the same name.

akbool

Boolean type (True or False), stored as a byte.

akbool

Boolean type (True or False), stored as a byte.

akfloat64

Double-precision floating-point number type, compatible with Python float

akfloat64

Double-precision floating-point number type, compatible with Python float

akint64

Signed integer type, compatible with Python int and C long.

akint64

Signed integer type, compatible with Python int and C long.

akint64

Signed integer type, compatible with Python int and C long.

akuint64

Unsigned integer type, compatible with C unsigned long.

akuint64

Unsigned integer type, compatible with C unsigned long.

akuint64

Unsigned integer type, compatible with C unsigned long.

all_scalars

The central part of internal API.

bigint

bigint

bitType

Unsigned integer type, compatible with C unsigned long.

bitType

Unsigned integer type, compatible with C unsigned long.

bool_

Boolean type (True or False), stored as a byte.

bool_scalars

The central part of internal API.

bool_scalars

The central part of internal API.

byte

Signed integer type, compatible with C char.

bytes_

A byte string.

cdouble

Complex number type composed of two double-precision floating-point

cfloat

Complex number type composed of two double-precision floating-point

character

Abstract base class of all character string scalar types.

clongdouble

Complex number type composed of two extended-precision floating-point

clongfloat

Complex number type composed of two extended-precision floating-point

complex128

Complex number type composed of two double-precision floating-point

complex64

Complex number type composed of two single-precision floating-point

csingle

Complex number type composed of two single-precision floating-point

datetime64

If created from a 64-bit integer, it represents an offset from

double

Double-precision floating-point number type, compatible with Python float

finfo

finfo(dtype)

flexible

Abstract base class of all scalar types without predefined length.

float16

Half-precision floating-point number type.

float32

Single-precision floating-point number type, compatible with C float.

float64

Double-precision floating-point number type, compatible with Python float

float_

Double-precision floating-point number type, compatible with Python float

float_scalars

The central part of internal API.

floating

Abstract base class of all floating-point scalar types.

format_parser

Class to convert formats, names, titles description to a dtype.

half

Half-precision floating-point number type.

iinfo

iinfo(type)

inexact

Abstract base class of all numeric scalar types with a (potentially)

int16

Signed integer type, compatible with C short.

int32

Signed integer type, compatible with C int.

int64

Signed integer type, compatible with Python int and C long.

int64

Signed integer type, compatible with Python int and C long.

int8

Signed integer type, compatible with C char.

intTypes

frozenset() -> empty frozenset object

intTypes

frozenset() -> empty frozenset object

intTypes

frozenset() -> empty frozenset object

int_

Signed integer type, compatible with Python int and C long.

int_scalars

The central part of internal API.

int_scalars

The central part of internal API.

int_scalars

The central part of internal API.

intc

Signed integer type, compatible with C int.

integer

Abstract base class of all integer scalar types.

intp

Signed integer type, compatible with Python int and C long.

longdouble

Extended-precision floating-point number type, compatible with C

longfloat

Extended-precision floating-point number type, compatible with C

longlong

Signed integer type, compatible with C long long.

number

Abstract base class of all numeric scalar types.

numeric_and_bool_scalars

The central part of internal API.

numeric_scalars

The central part of internal API.

numpy_scalars

The central part of internal API.

object_

Any Python object.

pdarray

The basic arkouda array class. This class contains only the

pdarray

The basic arkouda array class. This class contains only the

pdarray

The basic arkouda array class. This class contains only the

pdarray

The basic arkouda array class. This class contains only the

pdarray

The basic arkouda array class. This class contains only the

pdarray

The basic arkouda array class. This class contains only the

sctypeDict

dict() -> new empty dictionary

sctypes

dict() -> new empty dictionary

short

Signed integer type, compatible with C short.

signedinteger

Abstract base class of all signed integer scalar types.

single

Single-precision floating-point number type, compatible with C float.

sparray

The class for sparse arrays. This class contains only the

str_

A unicode string.

str_

A unicode string.

str_scalars

The central part of internal API.

timedelta64

A timedelta stored as a 64-bit integer.

ubyte

Unsigned integer type, compatible with C unsigned char.

uint

Unsigned integer type, compatible with C unsigned long.

uint16

Unsigned integer type, compatible with C unsigned short.

uint32

Unsigned integer type, compatible with C unsigned int.

uint64

Unsigned integer type, compatible with C unsigned long.

uint8

Unsigned integer type, compatible with C unsigned char.

uintc

Unsigned integer type, compatible with C unsigned int.

uintp

Unsigned integer type, compatible with C unsigned long.

ulonglong

Signed integer type, compatible with C unsigned long long.

unsignedinteger

Abstract base class of all unsigned integer scalar types.

ushort

Unsigned integer type, compatible with C unsigned short.

void

np.void(length_or_data, /, dtype=None)

Functions

BitVectorizer([width, reverse])

Make a callback (i.e. function) that can be called on an

abs(→ arkouda.pdarrayclass.pdarray)

Return the element-wise absolute value of the array.

add_newdoc(place, obj, doc[, warn_on_python])

Add documentation to an existing object, typically one defined in C

akabs(→ arkouda.pdarrayclass.pdarray)

Return the element-wise absolute value of the array.

akcast(→ Union[arkouda.pdarrayclass.pdarray, ...)

Cast an array to another dtype.

akcast(→ Union[arkouda.pdarrayclass.pdarray, ...)

Cast an array to another dtype.

align(*args)

Map multiple arrays of sparse identifiers to a common 0-up index.

arccos(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse cosine of the array. The result is between 0 and pi.

arccosh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse hyperbolic cosine of the array.

arcsin(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse sine of the array. The result is between -pi/2 and pi/2.

arcsinh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse hyperbolic sine of the array.

arctan(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse tangent of the array. The result is between -pi/2 and pi/2.

arctan2(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse tangent of the array pair. The result chosen is the

arctanh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise inverse hyperbolic tangent of the array.

argmaxk(→ pdarray)

Find the indices corresponding to the k maximum values of an array.

argmink(→ pdarray)

Finds the indices corresponding to the k minimum values of an array.

argsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that sorts the array.

argsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that sorts the array.

argsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that sorts the array.

array_equal(pda_a, pda_b[, equal_nan])

Compares two pdarrays for equality.

assert_almost_equal(→ None)

Check that the left and right objects are approximately equal.

assert_almost_equivalent(→ None)

Check that the left and right objects are approximately equal.

assert_arkouda_array_equal(→ None)

Check that 'ak.pdarray' or 'ak.Strings', 'ak.Categorical', or 'ak.SegArray' is equivalent.

assert_arkouda_array_equivalent(→ None)

Check that 'np.array', 'pd.Categorical', 'ak.pdarray', 'ak.Strings',

assert_arkouda_pdarray_equal(→ None)

Check that the two 'ak.pdarray's are equivalent.

assert_arkouda_segarray_equal(→ None)

Check that the two 'ak.segarray's are equivalent.

assert_arkouda_strings_equal(→ None)

Check that 'ak.Strings' is equivalent.

assert_attr_equal(→ None)

Check attributes are equal. Both objects must have attribute.

assert_categorical_equal(→ None)

Test that Categoricals are equivalent.

assert_class_equal(→ None)

Checks classes are equal.

assert_contains_all(→ None)

Assert that a dictionary contains all the elements of an iterable.

assert_copy(→ None)

Checks that the elements are equal, but not the same object.

assert_dict_equal(→ None)

Assert that two dictionaries are equal.

assert_equal(→ None)

Wrapper for tm.assert_*_equal to dispatch to the appropriate test function.

assert_equivalent(→ None)

Wrapper for tm.assert_*_equivalent to dispatch to the appropriate test function.

assert_frame_equal(→ None)

Check that left and right DataFrame are equal.

assert_frame_equivalent(→ None)

Check that left and right DataFrame are equal.

assert_index_equal(→ None)

Check that left and right Index are equal.

assert_index_equivalent(→ None)

Check that left and right Index are equal.

assert_is_sorted(→ None)

Assert that the sequence is sorted.

assert_series_equal(→ None)

Check that left and right Series are equal.

assert_series_equivalent(→ None)

Check that left and right Series are equal.

attach(name)

attach_all(names)

Attach to all objects registered with the names provide

attach_pdarray(→ pdarray)

class method to return a pdarray attached to the registered name in the arkouda

base_repr(number[, base, padding])

Return a string representation of a number in the given base system.

binary_repr(num[, width])

Return the binary representation of the input number as a string.

broadcast(segments, values[, size, permutation])

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

broadcast(segments, values[, size, permutation])

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

broadcast(segments, values[, size, permutation])

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

broadcast(segments, values[, size, permutation])

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

broadcast_dims(→ Tuple[int, Ellipsis])

Algorithm to determine shape of broadcasted PD array given two array shapes

broadcast_to_shape(→ pdarray)

expand an array's rank to the specified shape using broadcasting

cast(→ Union[arkouda.pdarrayclass.pdarray, ...)

Cast an array to another dtype.

cast(→ Union[arkouda.pdarrayclass.pdarray, ...)

Cast an array to another dtype.

ceil(→ arkouda.pdarrayclass.pdarray)

Return the element-wise ceiling of the array.

chisquare(f_obs[, f_exp, ddof])

Computes the chi square statistic and p-value.

clear(→ None)

Send a clear message to clear all unregistered data from the server symbol table

clip(→ arkouda.pdarrayclass.pdarray)

Clip (limit) the values in an array to a given range [lo,hi]

clz(→ pdarray)

Count leading zeros for each integer in an array.

coargsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that groups the rows (left-to-right), if the

coargsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that groups the rows (left-to-right), if the

coargsort(→ arkouda.pdarrayclass.pdarray)

Return the permutation that groups the rows (left-to-right), if the

compute_join_size(→ Tuple[int, int])

Compute the internal size of a hypothetical join between a and b. Returns

concatenate(→ Union[arkouda.pdarrayclass.pdarray, ...)

Concatenate a list or tuple of pdarray or Strings objects into

concatenate(→ Union[arkouda.pdarrayclass.pdarray, ...)

Concatenate a list or tuple of pdarray or Strings objects into

concatenate(→ Union[arkouda.pdarrayclass.pdarray, ...)

Concatenate a list or tuple of pdarray or Strings objects into

convert_if_categorical(values)

Convert a Categorical array to Strings for display

corr(→ numpy.float64)

Return the correlation between x and y

cos(→ arkouda.pdarrayclass.pdarray)

Return the element-wise cosine of the array.

cosh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise hyperbolic cosine of the array.

count_nonzero(pda)

Compute the nonzero count of a given array. 1D case only, for now.

cov(→ numpy.float64)

Return the covariance of x and y

create_pdarray(→ pdarray)

Return a pdarray instance pointing to an array created by the arkouda server.

create_pdarray(→ pdarray)

Return a pdarray instance pointing to an array created by the arkouda server.

create_pdarray(→ pdarray)

Return a pdarray instance pointing to an array created by the arkouda server.

create_sparray(→ sparray)

Return a sparray instance pointing to an array created by the arkouda server.

create_sparse_matrix(→ arkouda.sparrayclass.sparray)

Create a sparse matrix from three pdarrays representing the row indices,

ctz(→ pdarray)

Count trailing zeros for each integer in an array.

cumprod(→ arkouda.pdarrayclass.pdarray)

Return the cumulative product over the array.

cumsum(→ arkouda.pdarrayclass.pdarray)

Return the cumulative sum over the array.

cumsum(→ arkouda.pdarrayclass.pdarray)

Return the cumulative sum over the array.

cumsum(→ arkouda.pdarrayclass.pdarray)

Return the cumulative sum over the array.

date_operators(cls)

date_range([start, end, periods, freq, tz, normalize, ...])

Creates a fixed frequency Datetime range. Alias for

date_range([start, end, periods, freq, tz, normalize, ...])

Creates a fixed frequency Datetime range. Alias for

deg2rad(→ arkouda.pdarrayclass.pdarray)

Converts angles element-wise from degrees to radians.

delete(→ arkouda.pdarrayclass.pdarray)

Return a copy of 'arr' with elements along the specified axis removed.

deprecate(*args, **kwargs)

Issues a DeprecationWarning, adds warning to old_name's

deprecate_with_doc(msg)

Deprecates a function and includes the deprecation in its docstring.

disableVerbose(→ None)

Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting

disp(mesg[, device, linefeed])

Display a message on a device.

divmod(→ Tuple[pdarray, pdarray])

dot(→ Union[arkouda.numpy.dtypes.numpy_scalars, pdarray])

Returns the sum of the elementwise product of two arrays of the same size (the dot product) or

dtype(x)

enableVerbose(→ None)

Enables verbose logging (DEBUG log level) for all ArkoudaLoggers

exp(→ arkouda.pdarrayclass.pdarray)

Return the element-wise exponential of the array.

expm1(→ arkouda.pdarrayclass.pdarray)

Return the element-wise exponential of the array minus one.

export(read_path[, dataset_name, write_file, ...])

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be

eye(rows, cols[, diag, dt])

Return a pdarray with zeros everywhere except along a diagonal, which is all ones.

find(query, space[, all_occurrences, remove_missing])

Return indices of query items in a search list of items.

flip(→ Union[arkouda.pdarrayclass.pdarray, ...)

Reverse an array's values along a particular axis or axes.

floor(→ arkouda.pdarrayclass.pdarray)

Return the element-wise floor of the array.

fmod(→ pdarray)

Returns the element-wise remainder of division.

format_float_positional(x[, precision, unique, ...])

Format a floating-point scalar as a decimal string in positional notation.

format_float_scientific(x[, precision, unique, trim, ...])

Format a floating-point scalar as a decimal string in scientific notation.

gen_ranges(starts, ends[, stride, return_lengths])

Generate a segmented array of variable-length, contiguous ranges between pairs of

gen_ranges(starts, ends[, stride, return_lengths])

Generate a segmented array of variable-length, contiguous ranges between pairs of

generic_concat(items[, ordered])

getArkoudaLogger(→ ArkoudaLogger)

A convenience method for instantiating an ArkoudaLogger that retrieves the

get_byteorder(→ str)

Get a concrete byteorder (turns '=' into '<' or '>')

get_callback(x)

get_columns(→ List[str])

Get a list of column names from CSV file(s).

get_datasets(→ List[str])

Get the names of the datasets in the provide files

get_filetype(→ str)

Get the type of a file accessible to the server. Supported

get_null_indices(→ Union[arkouda.pdarrayclass.pdarray, ...)

Get null indices of a string column in a Parquet file.

get_server_byteorder(→ str)

Get the server's byteorder

hash(→ Union[Tuple[arkouda.pdarrayclass.pdarray, ...)

Return an element-wise hash of the array or list of arrays.

hist_all(ak_df[, cols])

Create a grid plot histogramming all numeric columns in ak dataframe

histogram(→ Tuple[arkouda.pdarrayclass.pdarray, ...)

Compute a histogram of evenly spaced bins over the range of an array.

histogram(→ Tuple[arkouda.pdarrayclass.pdarray, ...)

Compute a histogram of evenly spaced bins over the range of an array.

histogram2d(→ Tuple[arkouda.pdarrayclass.pdarray, ...)

Compute the bi-dimensional histogram of two data samples with evenly spaced bins

histogramdd(→ Tuple[arkouda.pdarrayclass.pdarray, ...)

Compute the multidimensional histogram of data in sample with evenly spaced bins.

import_data(read_path[, write_file, return_obj, index])

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or

in1d(→ arkouda.groupbyclass.groupable)

Test whether each element of a 1-D array is also present in a second array.

in1d(→ arkouda.groupbyclass.groupable)

Test whether each element of a 1-D array is also present in a second array.

in1d(→ arkouda.groupbyclass.groupable)

Test whether each element of a 1-D array is also present in a second array.

in1d_intervals(vals, intervals[, symmetric])

Test each value for membership in any of a set of half-open (pythonic)

indexof1d(→ arkouda.pdarrayclass.pdarray)

Return indices of query items in a search list of items. Items not found will be excluded.

information(→ str)

Returns JSON formatted string containing information about the objects in names

intersect(a, b[, positions, unique])

Find the intersection of two arkouda arrays.

intersect1d(→ Union[arkouda.pdarrayclass.pdarray, ...)

Find the intersection of two arrays.

interval_lookup(keys, values, arguments[, fillvalue, ...])

Apply a function defined over intervals to an array of arguments.

intx(a, b)

Find all the rows that are in both dataframes.

invert_permutation(perm)

Find the inverse of a permutation array.

ip_address(values)

Convert values to an Arkouda array of IP addresses.

isSupportedBool(num)

isSupportedDType(scalar)

isSupportedFloat(num)

isSupportedInt(num)

isSupportedInt(num)

isSupportedInt(num)

isSupportedInt(num)

isSupportedNumber(num)

is_cosorted(arrays)

Return True iff the arrays are cosorted, i.e., if the arrays were columns in a table

is_ipv4(→ arkouda.pdarrayclass.pdarray)

Indicate which values are ipv4 when passed data containing IPv4 and IPv6 values.

is_ipv6(→ arkouda.pdarrayclass.pdarray)

Indicate which values are ipv6 when passed data containing IPv4 and IPv6 values.

is_registered(→ bool)

Determine if the name provided is associated with a registered Object

isfinite(→ arkouda.pdarrayclass.pdarray)

Return the element-wise isfinite check applied to the array.

isinf(→ arkouda.pdarrayclass.pdarray)

Return the element-wise isinf check applied to the array.

isnan(→ arkouda.pdarrayclass.pdarray)

Return the element-wise isnan check applied to the array.

isnan(→ arkouda.pdarrayclass.pdarray)

Return the element-wise isnan check applied to the array.

isscalar(element)

Returns True if the type of element is a scalar type.

issctype(rep)

Determines whether the given object represents a scalar data-type.

issubclass_(arg1, arg2)

Determine if a class is a subclass of a second class.

issubdtype(arg1, arg2)

Returns True if first argument is a typecode lower/equal in type hierarchy.

join_on_eq_with_dt(...)

Performs an inner-join on equality between two integer arrays where

left_align(left, right)

Map two arrays of sparse identifiers to the 0-up index set implied by the left array,

list_registry([detailed])

Return a list containing the names of all registered objects

list_symbol_table(→ List[str])

Return a list containing the names of all objects in the symbol table

load(→ Union[Mapping[str, ...)

Load a pdarray previously saved with pdarray.save().

load_all(→ Mapping[str, ...)

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously

log(→ arkouda.pdarrayclass.pdarray)

Return the element-wise natural log of the array.

log10(→ arkouda.pdarrayclass.pdarray)

Return the element-wise base 10 log of the array.

log1p(→ arkouda.pdarrayclass.pdarray)

Return the element-wise natural log of one plus the array.

log2(→ arkouda.pdarrayclass.pdarray)

Return the element-wise base 2 log of the array.

lookup(keys, values, arguments[, fillvalue])

Apply the function defined by the mapping keys --> values to arguments.

ls(→ List[str])

This function calls the h5ls utility on a HDF5 file visible to the

ls_csv(→ List[str])

Used for identifying the datasets within a file when a CSV does not

matmul(pdaLeft, pdaRight)

Compute the product of two matrices.

maximum_sctype(t)

Return the scalar type of highest precision of the same kind as the input.

maxk(→ pdarray)

Find the k maximum values of an array.

mean(→ numpy.float64)

Return the mean of the array.

median(pda)

Compute the median of a given array. 1d case only, for now.

merge(→ DataFrame)

Merge Arkouda DataFrames with a database-style join.

mink(→ pdarray)

Find the k minimum values of an array.

mod(→ pdarray)

Returns the element-wise remainder of division.

parity(→ pdarray)

Find the bit parity (XOR of all bits) for each integer in an array.

plot_dist(b, h[, log, xlabel, newfig])

Plot the distribution and cumulative distribution of histogram Data

popcount(→ pdarray)

Find the population (number of bits set) for each integer in an array.

power(→ pdarray)

Raises an array to a power. If where is given, the operation will only take place in the positions

power_divergence(f_obs[, f_exp, ddof, lambda_])

Computes the power divergence statistic and p-value.

pretty_print_information(→ None)

Prints verbose information for each object in names in a human readable format

putmask(A, mask, Values)

Overwrites elements of A with elements from B based upon a mask array.

rad2deg(→ arkouda.pdarrayclass.pdarray)

Converts angles element-wise from radians to degrees.

random_sparse_matrix(→ arkouda.sparrayclass.sparray)

Create a random sparse matrix with the specified number of rows and columns

read(→ Union[Mapping[str, ...)

Read datasets from files.

read_csv(→ Union[Mapping[str, ...)

Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects

read_hdf(→ Union[Mapping[str, ...)

Read Arkouda objects from HDF5 file/s

read_parquet(→ Union[Mapping[str, ...)

Read Arkouda objects from Parquet file/s

read_tagged_data(filenames[, datasets, strictTypes, ...])

Read datasets from files and tag each record to the file it was read from.

read_zarr(store_path, ndim, dtype)

Reads a Zarr store from disk into a pdarray. Supports multi-dimensional pdarrays of numeric types.

receive(hostname, port)

Receive a pdarray sent by pdarray.transfer().

receive_dataframe(hostname, port)

Receive a pdarray sent by dataframe.transfer().

register_all(data)

Register all objects in the provided dictionary

resolve_scalar_dtype(→ str)

Try to infer what dtype arkouda_server should treat val as.

restore(filename)

Return data saved using ak.snapshot

right_align(left, right)

Map two arrays of sparse values to the 0-up index set implied by the right array,

rotl(→ pdarray)

Rotate bits of <x> to the left by <rot>.

rotr(→ pdarray)

Rotate bits of <x> to the left by <rot>.

round(→ arkouda.pdarrayclass.pdarray)

Return the element-wise rounding of the array.

save_all(→ None)

DEPRECATED

search_intervals(vals, intervals[, tiebreak, hierarchical])

Given an array of query vals and non-overlapping, closed intervals, return

segarray(segments, values[, lengths, grouping])

Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor

setdiff1d(→ Union[arkouda.pdarrayclass.pdarray, ...)

Find the set difference of two arrays.

setxor1d(→ Union[arkouda.pdarrayclass.pdarray, ...)

Find the set exclusive-or (symmetric difference) of two arrays.

shape(→ Tuple)

Return the shape of an array.

sign(→ arkouda.pdarrayclass.pdarray)

Return the element-wise sign of the array.

sin(→ arkouda.pdarrayclass.pdarray)

Return the element-wise sine of the array.

sinh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise hyperbolic sine of the array.

skew(→ numpy.float64)

Computes the sample skewness of an array.

snapshot(filename)

Create a snapshot of the current Arkouda namespace. All currently accessible variables containing

sort(→ arkouda.pdarrayclass.pdarray)

Return a sorted copy of the array. Only sorts numeric arrays;

sparse_matrix_matrix_mult(→ arkouda.sparrayclass.sparray)

Multiply two sparse matrices.

sqrt(→ pdarray)

Takes the square root of array. If where is given, the operation will only take place in

square(→ arkouda.pdarrayclass.pdarray)

Return the element-wise square of the array.

squeeze(→ arkouda.pdarrayclass.pdarray)

Remove degenerate (size one) dimensions from an array.

std(→ numpy.float64)

Return the standard deviation of values in the array. The standard

string_operators(cls)

tan(→ arkouda.pdarrayclass.pdarray)

Return the element-wise tangent of the array.

tanh(→ arkouda.pdarrayclass.pdarray)

Return the element-wise hyperbolic tangent of the array.

timedelta_range([start, end, periods, freq, name, closed])

Return a fixed frequency TimedeltaIndex, with day as the default

timedelta_range([start, end, periods, freq, name, closed])

Return a fixed frequency TimedeltaIndex, with day as the default

to_csv(columns, prefix_path[, names, col_delim, overwrite])

Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda

to_hdf(→ None)

Save multiple named pdarrays to HDF5 files.

to_parquet(→ None)

Save multiple named pdarrays to Parquet files.

to_zarr(store_path, arr, chunk_shape)

Writes a pdarray to disk as a Zarr store. Supports multi-dimensional pdarrays of numeric types.

transpose(pda)

Compute the transpose of a matrix.

tril(pda[, diag])

Return a copy of the pda with the upper triangle zeroed out

triu(pda[, diag])

Return a copy of the pda with the lower triangle zeroed out

trunc(→ arkouda.pdarrayclass.pdarray)

Return the element-wise truncation of the array.

typename(char)

Return a description for the given data type code.

union1d(→ arkouda.groupbyclass.groupable)

Find the union of two arrays/List of Arrays.

unique(→ Union[groupable, Tuple[groupable, pdarray, ...)

Find the unique elements of an array.

unique(→ Union[groupable, Tuple[groupable, pdarray, ...)

Find the unique elements of an array.

unique(→ Union[groupable, Tuple[groupable, pdarray, ...)

Find the unique elements of an array.

unregister(→ str)

unregister_all(names)

Unregister all names provided

unregister_pdarray_by_name(→ None)

Unregister a named pdarray in the arkouda server which was previously

unsqueeze(p)

update_hdf(columns, prefix_path[, names, repack])

Overwrite the datasets with name appearing in names or keys in columns if columns

value_counts(→ tuple)

Count the occurrences of the unique values of an array.

var(→ numpy.float64)

Return the variance of values in the array.

vecdot(x1, x2)

Compute the generalized dot product of two vectors along the given axis.

vstack(→ arkouda.pdarrayclass.pdarray)

Stack a sequence of arrays vertically (row-wise).

where(→ Union[arkouda.pdarrayclass.pdarray, ...)

Returns an array with elements chosen from A and B based upon a

where(→ Union[arkouda.pdarrayclass.pdarray, ...)

Returns an array with elements chosen from A and B based upon a

where(→ Union[arkouda.pdarrayclass.pdarray, ...)

Returns an array with elements chosen from A and B based upon a

write_log(log_msg[, tag, log_lvl])

Allows the user to write custom logs.

xlogy(x, y)

Computes x * log(y).

zero_up(vals)

Map an array of sparse values to 0-up indices.

Package Contents

class arkouda.ARKOUDA_SUPPORTED_DTYPES

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.ARKOUDA_SUPPORTED_INTS

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

arkouda.AllSymbols = '__AllSymbols__'
class arkouda.BitVector(values, width=64, reverse=False)[source]

Bases: arkouda.pdarrayclass.pdarray

Represent integers as bit vectors, e.g. a set of flags.

Parameters:
  • values (pdarray, int64) – The integers to represent as bit vectors

  • width (int) – The number of bit fields in the vector

  • reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.

Returns:

bitvectors – The array of binary vectors

Return type:

BitVector

Notes

This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like a uint64 pdarray.

conserves
format(x)[source]

Format a single binary vector as a string.

classmethod from_return_msg(rep_msg)[source]
opeq(other, op)[source]
register(user_defined_name)[source]

Register this BitVector object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the BitVector is to be registered under, this will be the root name for underlying components

Returns:

The same BitVector which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different BitVectors with the same name.

Return type:

BitVector

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the BitVector with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name = None
reverse = False
special_objType = 'BitVector'
to_list()[source]

Export data to a list of string-formatted bit vectors.

to_ndarray()[source]

Export data to a numpy array of string-formatted bit vectors.

values
width = 64
arkouda.BitVectorizer(width=64, reverse=False)[source]

Make a callback (i.e. function) that can be called on an array to create a BitVector.

Parameters:
  • width (int) – The number of bit fields in the vector

  • reverse (bool) – If True, display bits from least significant (left) to most significant (right). By default, the most significant bit is the left-most bit.

Returns:

bitvectorizer – A function that takes an array and returns a BitVector instance

Return type:

callable

class arkouda.BoolDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.ByteDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.BytesDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.CLongDoubleDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.CachedAccessor(name: str, accessor)[source]

Custom property-like object. A descriptor for caching accessors. :param name: Namespace that will be accessed under, e.g. df.foo. :type name: str :param accessor: Class with the extension methods. :type accessor: cls

Notes

For accessor, The class’s __init__ method assumes that one of Series, DataFrame or Index as the single argument data.

class arkouda.Categorical(values, **kwargs)[source]

Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:
  • values (Strings, Categorical, pd.Categorical) – Values to convert to categories

  • NAvalue (str scalar) – The value to use to represent missing/null data

categories

The set of category labels (determined automatically)

Type:

Strings

codes

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments

When values are grouped, the starting offset of each group

Type:

pdarray, int64

size

The number of items in the array

Type:

Union[int,np.int64]

nlevels

The number of distinct categories

Type:

Union[int,np.int64]

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

Union[int,np.int64]

shape

The sizes of each dimension of the array

Type:

tuple

BinOps
RegisterablePieces
RequiredPieces
argsort()[source]
static attach(user_defined_name: str) Categorical[source]

DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which Categorical object was registered under

Returns:

The Categorical object created by re-attaching to the corresponding server components

Return type:

Categorical

Raises:

TypeError – if user_defined_name is not a string

concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical[source]

Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.

Parameters:
  • others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

The merged Categorical object

Return type:

Categorical

Raises:

TypeError – Raised if any others array objects are not Categorical objects

Notes

This operation can be expensive – slower than concatenating Strings.

contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

dtype
endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Categoricals are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Categoricals are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> c = Categorical(ak.array(["a", "b", "c"]))
>>> c_cpy = Categorical(ak.array(["a", "b", "c"]))
>>> c.equals(c_cpy)
True
>>> c2 = Categorical(ak.array(["a", "x", "c"]))
>>> c.equals(c2)
False
classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical[source]

Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:
  • codes (pdarray, int64) – Category indices of each value

  • categories (Strings) – Unique category labels

  • permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

  • segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

classmethod from_return_msg(rep_msg) Categorical[source]

Create categorical from return message from server

Notes

This is currently only used when reading a Categorical from HDF5 files.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each element of the Categorical.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray[source]

Test whether each element of the Categorical object is also present in the test Strings or Categorical object.

Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.

Parameters:

test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.

Returns:

The values self[in1d] are in the test Strings or Categorical object.

Return type:

pdarray, bool

Raises:

TypeError – Raised if test is not a Strings or Categorical object

Notes

in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences. in1d(a, b) is logically equivalent to ak.array([item in b for item in a]), but is much faster and scales to arbitrarily large a.

Examples

>>> strings = ak.array([f'String {i}' for i in range(0,5)])
>>> cat = ak.Categorical(strings)
>>> ak.in1d(cat,strings)
array([True, True, True, True, True])
>>> strings = ak.array([f'String {i}' for i in range(5,9)])
>>> catTwo = ak.Categorical(strings)
>>> ak.in1d(cat,catTwo)
array([False, False, False, False, False])
property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isna()[source]

Find where values are missing or null (as defined by self.NAvalue)

logger
property nbytes

The size of the Categorical in bytes.

Returns:

The size of the Categorical in bytes.

Return type:

int

ndim
nlevels
objType = 'Categorical'
static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]][source]

This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.

In general you should not call this method directly

Parameters:

d (Dictionary of String to either Pdarray or Strings object)

Returns:

  • 2-Tuple of List of strings containing key names which should be removed and Dictionary of

  • base name to Categorical object

permutation: arkouda.pdarrayclass.pdarray | None = None
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

register(user_defined_name: str) Categorical[source]

Register this Categorical object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components

Returns:

The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.

Return type:

Categorical

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Categorical with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
reset_categories() Categorical[source]

Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.

Returns:

A Categorical object generated from the current instance

Return type:

Categorical

save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str[source]

DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Categorical dataset within existing files.

Parameters:
  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

  • compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.

Return type:

String message indicating result of save operation

Raises:
  • ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’

  • TypeError – Raised if prefix_path, dataset, or mode is not a str

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.

See also

-, -

segments = None
set_categories(new_categories, NAvalue=None)[source]

Set categories to user-defined values.

Parameters:
  • new_categories (Strings) – The array of new categories to use. Must be unique.

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.

Return type:

Categorical

shape
size: arkouda.numpy.dtypes.int_scalars
sort_values()[source]
classmethod standardize_categories(arrays, NAvalue='N/A')[source]

Standardize an array of Categoricals so that they share the same categories.

Parameters:
  • arrays (sequence of Categoricals) – The Categoricals to standardize

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A list of the original Categoricals remapped to the shared categories.

Return type:

List of Categoricals

startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')[source]

Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.

Return type:

None

See also

load

to_list() List[source]

Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list of strings corresponding to the values in this Categorical

Return type:

list

Notes

The number of bytes in the Categorical cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray of strings corresponding to the values in this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_pandas() pandas.Categorical[source]

Return the equivalent Pandas Categorical.

to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str[source]

This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.

  • compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

Return type:

String message indicating result of save operation

Raises:

RuntimeError – On run due to compatability issues of Categorical with Parquet.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_strings() List[source]

Convert the Categorical to Strings.

Returns:

A Strings object corresponding to the values in this Categorical.

Return type:

arkouda.strings.Strings

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array(["a","b","c"])
>>> a
>>> c = ak.Categorical(a)
>>>  c.to_strings()
c.to_strings()
>>> isinstance(c.to_strings(), ak.Strings)
True
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Categorical object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unique() Categorical[source]
unregister() None[source]

Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_categorical_by_name(user_defined_name: str) None[source]

Function to unregister Categorical object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the Categorical object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]

Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Categorical

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.

class arkouda.Categorical(values, **kwargs)[source]

Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:
  • values (Strings, Categorical, pd.Categorical) – Values to convert to categories

  • NAvalue (str scalar) – The value to use to represent missing/null data

categories

The set of category labels (determined automatically)

Type:

Strings

codes

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments

When values are grouped, the starting offset of each group

Type:

pdarray, int64

size

The number of items in the array

Type:

Union[int,np.int64]

nlevels

The number of distinct categories

Type:

Union[int,np.int64]

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

Union[int,np.int64]

shape

The sizes of each dimension of the array

Type:

tuple

BinOps
RegisterablePieces
RequiredPieces
argsort()[source]
static attach(user_defined_name: str) Categorical[source]

DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which Categorical object was registered under

Returns:

The Categorical object created by re-attaching to the corresponding server components

Return type:

Categorical

Raises:

TypeError – if user_defined_name is not a string

concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical[source]

Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.

Parameters:
  • others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

The merged Categorical object

Return type:

Categorical

Raises:

TypeError – Raised if any others array objects are not Categorical objects

Notes

This operation can be expensive – slower than concatenating Strings.

contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

dtype
endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Categoricals are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Categoricals are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> c = Categorical(ak.array(["a", "b", "c"]))
>>> c_cpy = Categorical(ak.array(["a", "b", "c"]))
>>> c.equals(c_cpy)
True
>>> c2 = Categorical(ak.array(["a", "x", "c"]))
>>> c.equals(c2)
False
classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical[source]

Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:
  • codes (pdarray, int64) – Category indices of each value

  • categories (Strings) – Unique category labels

  • permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

  • segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

classmethod from_return_msg(rep_msg) Categorical[source]

Create categorical from return message from server

Notes

This is currently only used when reading a Categorical from HDF5 files.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each element of the Categorical.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray[source]

Test whether each element of the Categorical object is also present in the test Strings or Categorical object.

Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.

Parameters:

test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.

Returns:

The values self[in1d] are in the test Strings or Categorical object.

Return type:

pdarray, bool

Raises:

TypeError – Raised if test is not a Strings or Categorical object

Notes

in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences. in1d(a, b) is logically equivalent to ak.array([item in b for item in a]), but is much faster and scales to arbitrarily large a.

Examples

>>> strings = ak.array([f'String {i}' for i in range(0,5)])
>>> cat = ak.Categorical(strings)
>>> ak.in1d(cat,strings)
array([True, True, True, True, True])
>>> strings = ak.array([f'String {i}' for i in range(5,9)])
>>> catTwo = ak.Categorical(strings)
>>> ak.in1d(cat,catTwo)
array([False, False, False, False, False])
property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isna()[source]

Find where values are missing or null (as defined by self.NAvalue)

logger
property nbytes

The size of the Categorical in bytes.

Returns:

The size of the Categorical in bytes.

Return type:

int

ndim
nlevels
objType = 'Categorical'
static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]][source]

This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.

In general you should not call this method directly

Parameters:

d (Dictionary of String to either Pdarray or Strings object)

Returns:

  • 2-Tuple of List of strings containing key names which should be removed and Dictionary of

  • base name to Categorical object

permutation: arkouda.pdarrayclass.pdarray | None = None
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

register(user_defined_name: str) Categorical[source]

Register this Categorical object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components

Returns:

The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.

Return type:

Categorical

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Categorical with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
reset_categories() Categorical[source]

Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.

Returns:

A Categorical object generated from the current instance

Return type:

Categorical

save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str[source]

DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Categorical dataset within existing files.

Parameters:
  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

  • compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.

Return type:

String message indicating result of save operation

Raises:
  • ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’

  • TypeError – Raised if prefix_path, dataset, or mode is not a str

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.

See also

-, -

segments = None
set_categories(new_categories, NAvalue=None)[source]

Set categories to user-defined values.

Parameters:
  • new_categories (Strings) – The array of new categories to use. Must be unique.

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.

Return type:

Categorical

shape
size: arkouda.numpy.dtypes.int_scalars
sort_values()[source]
classmethod standardize_categories(arrays, NAvalue='N/A')[source]

Standardize an array of Categoricals so that they share the same categories.

Parameters:
  • arrays (sequence of Categoricals) – The Categoricals to standardize

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A list of the original Categoricals remapped to the shared categories.

Return type:

List of Categoricals

startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')[source]

Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.

Return type:

None

See also

load

to_list() List[source]

Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list of strings corresponding to the values in this Categorical

Return type:

list

Notes

The number of bytes in the Categorical cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray of strings corresponding to the values in this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_pandas() pandas.Categorical[source]

Return the equivalent Pandas Categorical.

to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str[source]

This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.

  • compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

Return type:

String message indicating result of save operation

Raises:

RuntimeError – On run due to compatability issues of Categorical with Parquet.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_strings() List[source]

Convert the Categorical to Strings.

Returns:

A Strings object corresponding to the values in this Categorical.

Return type:

arkouda.strings.Strings

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array(["a","b","c"])
>>> a
>>> c = ak.Categorical(a)
>>>  c.to_strings()
c.to_strings()
>>> isinstance(c.to_strings(), ak.Strings)
True
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Categorical object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unique() Categorical[source]
unregister() None[source]

Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_categorical_by_name(user_defined_name: str) None[source]

Function to unregister Categorical object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the Categorical object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]

Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Categorical

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.

class arkouda.Categorical(values, **kwargs)[source]

Represents an array of values belonging to named categories. Converting a Strings object to Categorical often saves memory and speeds up operations, especially if there are many repeated values, at the cost of some one-time work in initialization.

Parameters:
  • values (Strings, Categorical, pd.Categorical) – Values to convert to categories

  • NAvalue (str scalar) – The value to use to represent missing/null data

categories

The set of category labels (determined automatically)

Type:

Strings

codes

The category indices of the values or -1 for N/A

Type:

pdarray, int64

permutation

The permutation that groups the values in the same order as categories

Type:

pdarray, int64

segments

When values are grouped, the starting offset of each group

Type:

pdarray, int64

size

The number of items in the array

Type:

Union[int,np.int64]

nlevels

The number of distinct categories

Type:

Union[int,np.int64]

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

Union[int,np.int64]

shape

The sizes of each dimension of the array

Type:

tuple

BinOps
RegisterablePieces
RequiredPieces
argsort()[source]
static attach(user_defined_name: str) Categorical[source]

DEPRECATED Function to return a Categorical object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which Categorical object was registered under

Returns:

The Categorical object created by re-attaching to the corresponding server components

Return type:

Categorical

Raises:

TypeError – if user_defined_name is not a string

concatenate(others: Sequence[Categorical], ordered: bool = True) Categorical[source]

Merge this Categorical with other Categorical objects in the array, concatenating the arrays and synchronizing the categories.

Parameters:
  • others (Sequence[Categorical]) – The Categorical arrays to concatenate and merge with this one

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

The merged Categorical object

Return type:

Categorical

Raises:

TypeError – Raised if any others array objects are not Categorical objects

Notes

This operation can be expensive – slower than concatenating Strings.

contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

dtype
endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Categoricals are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Categoricals are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> c = Categorical(ak.array(["a", "b", "c"]))
>>> c_cpy = Categorical(ak.array(["a", "b", "c"]))
>>> c.equals(c_cpy)
True
>>> c2 = Categorical(ak.array(["a", "x", "c"]))
>>> c.equals(c2)
False
classmethod from_codes(codes: arkouda.pdarrayclass.pdarray, categories: arkouda.strings.Strings, permutation=None, segments=None, **kwargs) Categorical[source]

Make a Categorical from codes and categories arrays. If codes and categories have already been pre-computed, this constructor saves time. If not, please use the normal constructor.

Parameters:
  • codes (pdarray, int64) – Category indices of each value

  • categories (Strings) – Unique category labels

  • permutation (pdarray, int64) – The permutation that groups the values in the same order as categories

  • segments (pdarray, int64) – When values are grouped, the starting offset of each group

Returns:

The Categorical object created from the input parameters

Return type:

Categorical

Raises:

TypeError – Raised if codes is not a pdarray of int64 objects or if categories is not a Strings object

classmethod from_return_msg(rep_msg) Categorical[source]

Create categorical from return message from server

Notes

This is currently only used when reading a Categorical from HDF5 files.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent categories together. All instances of the same category are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

This method is faster than the corresponding Strings method. If the Categorical was created from a Strings object, then this function simply returns the cached permutation. Even if the Categorical was created using from_codes(), this function will be faster than Strings.group() because it sorts dense integer values, rather than 128-bit hash values.

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each element of the Categorical.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

in1d(test: arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray[source]

Test whether each element of the Categorical object is also present in the test Strings or Categorical object.

Returns a boolean array the same length as self that is True where an element of self is in test and False otherwise.

Parameters:

test (Union[Strings,Categorical]) – The values against which to test each value of ‘self`.

Returns:

The values self[in1d] are in the test Strings or Categorical object.

Return type:

pdarray, bool

Raises:

TypeError – Raised if test is not a Strings or Categorical object

Notes

in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences. in1d(a, b) is logically equivalent to ak.array([item in b for item in a]), but is much faster and scales to arbitrarily large a.

Examples

>>> strings = ak.array([f'String {i}' for i in range(0,5)])
>>> cat = ak.Categorical(strings)
>>> ak.in1d(cat,strings)
array([True, True, True, True, True])
>>> strings = ak.array([f'String {i}' for i in range(5,9)])
>>> catTwo = ak.Categorical(strings)
>>> ak.in1d(cat,catTwo)
array([False, False, False, False, False])
property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isna()[source]

Find where values are missing or null (as defined by self.NAvalue)

logger
property nbytes

The size of the Categorical in bytes.

Returns:

The size of the Categorical in bytes.

Return type:

int

ndim
nlevels
objType = 'Categorical'
static parse_hdf_categoricals(d: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings]) Tuple[List[str], Dict[str, Categorical]][source]

This function should be used in conjunction with the load_all function which reads hdf5 files and reconstitutes Categorical objects. Categorical objects use a naming convention and HDF5 structure so they can be identified and constructed for the user.

In general you should not call this method directly

Parameters:

d (Dictionary of String to either Pdarray or Strings object)

Returns:

  • 2-Tuple of List of strings containing key names which should be removed and Dictionary of

  • base name to Categorical object

permutation: arkouda.pdarrayclass.pdarray | None = None
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

register(user_defined_name: str) Categorical[source]

Register this Categorical object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Categorical is to be registered under, this will be the root name for underlying components

Returns:

The same Categorical which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Categoricals with the same name.

Return type:

Categorical

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Categorical with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
reset_categories() Categorical[source]

Recompute the category labels, discarding any unused labels. This method is often useful after slicing or indexing a Categorical array, when the resulting array only contains a subset of the original categories. In this case, eliminating unused categories can speed up other operations.

Returns:

A Categorical object generated from the current instance

Return type:

Categorical

save(prefix_path: str, dataset: str = 'categorical_array', file_format: str = 'HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) str[source]

DEPRECATED Save the Categorical object to HDF5 or Parquet. The result is a collection of HDF5/Parquet files, one file per locale of the arkouda server, where each filename starts with prefix_path and dataset. Each locale saves its chunk of the Strings array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in HDF5 files (must not already exist) :type dataset: str :param file_format: The format to save the file to. :type file_format: str {‘HDF5 | ‘Parquet’} :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Categorical dataset within existing files.

Parameters:
  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

  • compression (str (Optional)) – {None | ‘snappy’ | ‘gzip’ | ‘brotli’ | ‘zstd’ | ‘lz4’} The compression type to use when writing. This is only supported for Parquet files and will not be used with HDF5.

Return type:

String message indicating result of save operation

Raises:
  • ValueError – Raised if the lengths of columns and values differ, or the mode is neither ‘truncate’ nor ‘append’

  • TypeError – Raised if prefix_path, dataset, or mode is not a str

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter.

See also

-, -

segments = None
set_categories(new_categories, NAvalue=None)[source]

Set categories to user-defined values.

Parameters:
  • new_categories (Strings) – The array of new categories to use. Must be unique.

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A new Categorical with the user-defined categories. Old values present in new categories will appear unchanged. Old values not present will be assigned the NA value.

Return type:

Categorical

shape
size: arkouda.numpy.dtypes.int_scalars
sort_values()[source]
classmethod standardize_categories(arrays, NAvalue='N/A')[source]

Standardize an array of Categoricals so that they share the same categories.

Parameters:
  • arrays (sequence of Categoricals) – The Categoricals to standardize

  • NAvalue (str scalar) – The value to use to represent missing/null data

Returns:

A list of the original Categoricals remapped to the shared categories.

Return type:

List of Categoricals

startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The substring to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Notes

This method can be significantly faster than the corresponding method on Strings objects, because it searches the unique category labels instead of the full array.

to_hdf(prefix_path, dataset='categorical_array', mode='truncate', file_type='distribute')[source]

Save the Categorical to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale.

Return type:

None

See also

load

to_list() List[source]

Convert the Categorical to a list, transferring data from the arkouda server to Python. This conversion discards category information and produces a list of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list of strings corresponding to the values in this Categorical

Return type:

list

Notes

The number of bytes in the Categorical cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. This conversion discards category information and produces an ndarray of strings. If the arrays exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray of strings corresponding to the values in this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

to_pandas() pandas.Categorical[source]

Return the equivalent Pandas Categorical.

to_parquet(prefix_path: str, dataset: str = 'categorical_array', mode: str = 'truncate', compression: str | None = None) str[source]

This functionality is currently not supported and will also raise a RuntimeError. Support is in development. Save the Categorical to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in HDF5 files (must not already exist)

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Categorical dataset within existing files.

  • compression (str (Optional)) – Default None Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

Return type:

String message indicating result of save operation

Raises:

RuntimeError – On run due to compatability issues of Categorical with Parquet.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_strings() List[source]

Convert the Categorical to Strings.

Returns:

A Strings object corresponding to the values in this Categorical.

Return type:

arkouda.strings.Strings

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array(["a","b","c"])
>>> a
>>> c = ak.Categorical(a)
>>>  c.to_strings()
c.to_strings()
>>> isinstance(c.to_strings(), ak.Strings)
True
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Categorical object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Categorical is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unique() Categorical[source]
unregister() None[source]

Unregister this Categorical object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_categorical_by_name(user_defined_name: str) None[source]

Function to unregister Categorical object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the Categorical object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path, dataset='categorical_array', repack=True)[source]

Overwrite the dataset with the name provided with this Categorical object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Categorical

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, the repack option allows for automatic creation of a file without the inaccessible data.

class arkouda.Complex128DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Complex64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.DType[source]

An enumeration.

BIGINT(*args, **kwargs)

An enumeration.

BOOL(*args, **kwargs)

An enumeration.

COMPLEX128(*args, **kwargs)

An enumeration.

COMPLEX64(*args, **kwargs)

An enumeration.

FLOAT(*args, **kwargs)

An enumeration.

FLOAT32(*args, **kwargs)

An enumeration.

FLOAT64(*args, **kwargs)

An enumeration.

INT(*args, **kwargs)

An enumeration.

INT16(*args, **kwargs)

An enumeration.

INT32(*args, **kwargs)

An enumeration.

INT64(*args, **kwargs)

An enumeration.

INT8(*args, **kwargs)

An enumeration.

STR(*args, **kwargs)

An enumeration.

UINT(*args, **kwargs)

An enumeration.

UINT16(*args, **kwargs)

An enumeration.

UINT32(*args, **kwargs)

An enumeration.

UINT64(*args, **kwargs)

An enumeration.

UINT8(*args, **kwargs)

An enumeration.

name(*args, **kwargs)

The name of the Enum member.

value(*args, **kwargs)

The value of the Enum member.

class arkouda.DTypeObjects

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.DTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.DataFrame(dict=None, /, **kwargs)[source]

Bases: collections.UserDict

A DataFrame structure based on arkouda arrays.

Parameters:
  • initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.

  • index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.

  • columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.

Examples

Create an empty DataFrame and add a column of data:

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame()
>>> df['a'] = ak.array([1,2,3])
>>> display(df)

a

0

1

1

2

2

3

Create a new DataFrame using a dictionary of data:

>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> userID = ak.array([111, 222, 111, 333, 222, 111])
>>> item = ak.array([0, 0, 1, 1, 2, 0])
>>> day = ak.array([5, 5, 6, 5, 6, 6])
>>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
>>> df = ak.DataFrame({'userName': userName, 'userID': userID,
>>>            'item': item, 'day': day, 'amount': amount})
>>> display(df)

userName

userID

item

day

amount

0

Alice

111

0

5

0.5

1

Bob

222

0

5

0.6

2

Alice

111

1

6

1.1

3

Carol

333

1

5

1.2

4

Bob

222

2

6

4.3

5

Alice

111

0

6

0.6

Indexing works slightly differently than with pandas:

>>> df[0]

keys

values

userName

Alice

userID

111

item

0

day

5

amount

0.5

>>> df['userID']
array([111, 222, 111, 333, 222, 111])
>>> df['userName']
array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> df[ak.array([1,3,5])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Carol

333

1

5

1.2

2

Alice

111

0

6

0.6

Compute the stride:

>>> df[1:5:1]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

3

Bob

222

2

6

4.3

>>> df[ak.array([1,2,3])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

>>> df[['userID', 'day']]

userID

day

0

111

5

1

222

5

2

111

6

3

333

5

4

222

6

5

111

6

GroupBy(keys, use_series=False, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

all(axis=0) Series | bool[source]

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element along a Dataframe axis that is False.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[True,True,True,True]})

A

B

C

D

0

True

True

True

True

1

True

True

False

True

2

True

True

True

True

3

False

False

False

True

>>> df.all(axis=0)
A    False
B    False
C    False
D     True
dtype: bool
>>> df.all(axis=1)
0     True
1    False
2     True
3    False
dtype: bool
>>> df.all(axis=None)
False
any(axis=0) Series | bool[source]

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element along a Dataframe axis that is True.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[False,False,False,False]})

A

B

C

D

0

True

True

True

False

1

True

True

False

False

2

True

True

True

False

3

False

False

False

False

>>> df.any(axis=0)
A     True
B     True
C     True
D    False
dtype: bool
>>> df.any(axis=1)
0     True
1     True
2     True
3    False
dtype: bool
>>> df.any(axis=None)
True
append(other, ordered=True)[source]

Concatenate data from ‘other’ onto the end of this DataFrame, in place.

Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.

Parameters:
  • other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.

  • ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.

Returns:

Appending occurs in-place, but result is returned for compatibility.

Return type:

self

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

col1

col2

0

1

3

1

2

4

>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]})

col1

col2

0

3

5

>>> df1.append(df2)
>>> df1

col1

col2

0

1

3

1

2

4

2

3

5

apply_permutation(perm)[source]

Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified.

This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.

Parameters:

perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.

Return type:

None

See also

sort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

col1

col2

0

1

4

1

2

5

2

3

6

>>> perm_arry = ak.array([0, 2, 1])
>>> df.apply_permutation(perm_arry)
>>> display(df)

col1

col2

0

1

4

1

3

6

2

2

5

argsort(key, ascending=True)[source]

Return the permutation that sorts the dataframe by key.

Parameters:
  • key (str) – The key to sort on.

  • ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.

Returns:

The permutation array that sorts the data on key.

Return type:

arkouda.pdarrayclass.pdarray

See also

coargsort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]})
>>> display(df)

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.argsort('col1')
array([0 2 1])
>>> sorted_df1 = df[df.argsort('col1')]
>>> display(sorted_df1)

col1

col2

0

1.1

6

1

2.1

4

2

3.1

5

>>> df.argsort('col2')
array([2 1 0])
>>> sorted_df2 = df[df.argsort('col2')]
>>> display(sorted_df2)

col1

col2

0

2.1

4

1

3.1

5

2

1.1

6

assign(**kwargs) DataFrame[source]

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:

**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

A new DataFrame with the new columns in addition to all the existing columns.

Return type:

DataFrame

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Examples

>>> df = ak.DataFrame({'temp_c': [17.0, 25.0]},
...                   index=['Portland', 'Berkeley'])
>>> df
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
...           temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
attach(user_defined_name: str) DataFrame[source]

Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register().

Parameters:

user_defined_name (str) – user defined name which DataFrame object was registered under.

Returns:

The DataFrame object created by re-attaching to the corresponding server components.

Return type:

arkouda.dataframe.DataFrame

Raises:

RegistrationError – if user_defined_name is not registered

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
coargsort(keys, ascending=True)[source]

Return the permutation that sorts the dataframe by keys.

Note: Sorting using Strings may not yield correct sort order.

Parameters:

keys (list of str) – The keys to sort on.

Returns:

The permutation array that sorts the data on keys.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.coargsort(['col1', 'col2'])
array([2 0 1])
>>>
property columns

An Index where the values are the column names of the dataframe.

Returns:

The values of the index are the column names of the dataframe.

Return type:

arkouda.index.Index

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.columns
Index(array(['col1', 'col2']), dtype='<U0')
concat(items, ordered=True)[source]

Essentially an append, but different formatting.

corr() DataFrame[source]

Return new DataFrame with pairwise correlation of columns.

Returns:

Arkouda DataFrame containing correlation matrix of all columns.

Return type:

arkouda.dataframe.DataFrame

Raises:

RuntimeError – Raised if there’s a server-side error thrown.

See also

pdarray.corr

Notes

Generates the correlation matrix using Pearson R for all columns.

Attempts to convert to numeric values where possible for inclusion in the matrix.

Example

>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]})
>>> display(df)

col1

col2

0

1

-1

1

2

-2

>>> corr = df.corr()

col1

col2

col1

1

-1

col2

-1

1

count(axis: int | str = 0, numeric_only=False) Series[source]

Count non-NA cells for each column or row.

The values np.NaN are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool = False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries.

Return type:

arkouda.series.Series

Raises:

ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.

See also

GroupBy.count

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

nan

9

>>> df.count()
col_A    1
col_B    2
dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])})
>>> display(df)

col_A

col_B

0

a

1

1

b

nan

2

c

nan

>>> df.count()
col_A    3
col_B    1
dtype: int64
>>> df.count(numeric_only=True)
col_B    1
dtype: int64
>>> df.count(axis=1)
0    2
1    1
2    1
dtype: int64
drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame[source]

Drop column/s or row/s from the dataframe.

Parameters:
  • keys (str, int or list) – The labels to be dropped on the given axis.

  • axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> display(df)

col1

col2

0

1

3

1

2

4

Drop column

>>> df.drop('col1', axis = 1)

col2

0

3

1

4

Drop row

>>> df.drop(0, axis = 0)

col1

col2

0

2

4

drop_duplicates(subset=None, keep='first')[source]

Drops duplcated rows and returns resulting DataFrame.

If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).

Parameters:
  • subset (Iterable) – Iterable of column names to use to dedupe.

  • keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.

Returns:

DataFrame with duplicates removed.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

5

3

3

6

>>> df.drop_duplicates()

col1

col2

0

1

4

1

2

5

2

3

6

dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame[source]

Remove missing values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default = 0) –

    Determine if rows or columns which contain missing values are removed.

    0, or ‘index’: Drop rows which contain missing values.

    1, or ‘columns’: Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default='any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    ’any’: If any NA values are present, drop that row or column.

    ’all’: If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DataFrame with NA entries dropped from it.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame(
    {
        "A": [True, True, True, True],
        "B": [1, np.nan, 2, np.nan],
        "C": [1, 2, 3, np.nan],
        "D": [False, False, False, False],
        "E": [1, 2, 3, 4],
        "F": ["a", "b", "c", "d"],
        "G": [1, 2, 3, 4],
    }
   )
>>> display(df)

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

>>> df.dropna()

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

2

3

False

3

c

3

>>> df.dropna(axis=1)

A

D

E

F

G

0

True

False

1

a

1

1

True

False

2

b

2

2

True

False

3

c

3

3

True

False

4

d

4

>>> df.dropna(axis=1, thresh=3)

A

C

D

E

F

G

0

True

1

False

1

a

1

1

True

2

False

2

b

2

2

True

3

False

3

c

3

3

True

nan

False

4

d

4

>>> df.dropna(axis=1, how="all")

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

property dtypes: DataFrame

The dtypes of the dataframe.

Returns:

dtypes – The dtypes of the dataframe.

Return type:

arkouda.row.Row

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.dtypes

keys

values

col1

int64

col2

str

property empty: DataFrame

Whether the dataframe is empty.

Returns:

True if the dataframe is empty, otherwise False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({})
>>> df
 0 rows x 0 columns
>>> df.empty
True
filter_by_range(keys, low=1, high=None)[source]

Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].

To filter by a specific value, set low == high.

Parameters:
  • keys (str or list of str) – The names of the columns to group by.

  • low (int, default=1) – The lowest value count.

  • high (int, default=None) – The highest value count, default to unlimited.

Returns:

An array of boolean values for qualified rows in this DataFrame.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

6

3

2

7

4

3

8

5

3

9

>>> df.filter_by_range("col1", low=1, high=2)
array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)]
>>> display(filtered_df)

col1

col2

0

1

4

1

3

8

2

3

9

from_pandas(pd_df)[source]

Copy the data from a pandas DataFrame into a new arkouda.dataframe.DataFrame.

Parameters:

pd_df (pandas.DataFrame) – A pandas DataFrame to convert.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import pandas as pd
>>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]})
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

1

3

1

2

4

>>> ak_df = DataFrame.from_pandas(pd_df)
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

1

3

1

2

4

from_return_msg(rep_msg)[source]

Creates a DataFrame object from an arkouda server response message.

Parameters:

rep_msg (string) – Server response message used to create a DataFrame.

Return type:

arkouda.dataframe.DataFrame

groupby(keys, use_series=True, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns. Alias for GroupBy.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

head(n=5)[source]

Return the first n rows.

This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.

Parameters:

n (int, default = 5) – Number of rows to select.

Returns:

The first n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

tail

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.head()

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> df.head(n=2)

col1

col2

0

0

0

1

1

-1

property index

The index of the dataframe.

Returns:

The index of the dataframe.

Return type:

arkouda.index.Index or arkouda.index.MultiIndex

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.index
Index(array([0 1]), dtype='int64')
property info

Returns a summary string of this dataframe.

Returns:

A summary string of this dataframe.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.info
"DataFrame(['col1', 'col2'], 2 rows, 20 B)"
is_registered() bool[source]

Return True if the object is contained in the registry.

Returns:

Indicates if the object is contained in the registry.

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
isin(values: pdarray | Dict | Series | DataFrame) DataFrame[source]

Determine whether each element in the DataFrame is contained in values.

Parameters:

values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.

Returns:

Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values.

Return type:

arkouda.dataframe.DataFrame

See also

ak.Series.isin

Notes

  • Pandas supports values being an iterable type. In arkouda, we replace this with pdarray.

  • Pandas supports ~ operations. Currently, ak.DataFrame does not support this.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

3

9

When values is a pdarray, check every value in the DataFrame to determine if it exists in values.

>>> df.isin(ak.array([0, 1]))

col_A

col_B

0

0

1

1

0

0

When values is a dict, the values in the dict are passed to check the column indicated by the key.

>>> df.isin({'col_A': ak.array([0, 3])})

col_A

col_B

0

0

0

1

1

0

When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same.

>>> i = ak.Index(ak.arange(2))
>>> s = ak.Series(data=[3, 9], index=i)
>>> df.isin(s)

col_A

col_B

0

0

0

1

0

1

When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match.

>>> other_df = ak.DataFrame({'col_A':ak.array([7, 3]), 'col_C':ak.array([0, 9])})
>>> df.isin(other_df)

col_A

col_B

0

1

0

1

1

0

isna() DataFrame[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. numpy.NaN values get mapped to True values. Everything else gets mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.isna()
       A      B      C      D
0   True  False  False  False
1  False   True   True  False
2  False  False  False  False
3  False  False   True  False (4 rows x 4 columns)
load(prefix_path, file_format='INFER')[source]

Load dataframe from file. file_format needed for consistency with other load functions.

Parameters:
  • prefix_path (str) – The prefix path for the data.

  • file_format (string, default = "INFER")

Returns:

A dataframe loaded from the prefix_path.

Return type:

arkouda.dataframe.DataFrame

Examples

To store data in <my_dir>/my_data_LOCALE0000, use “<my_dir>/my_data” as the prefix.

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path, file_type="distribute")
>>> df.load(my_path)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

memory_usage(index=True, unit='B') Series[source]

Return the memory usage of each column in bytes.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.

  • unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

Return type:

Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> dtypes = [ak.int64, ak.float64,  ak.bool]
>>> data = dict([(str(t), ak.ones(5000, dtype=ak.int64).astype(t)) for t in dtypes])
>>> df = ak.DataFrame(data)
>>> display(df.head())

int64

float64

bool

0

1

1

True

1

1

1

True

2

1

1

True

3

1

1

True

4

1

1

True

>>> df.memory_usage()

0

Index

40000

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(index=False)

0

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(unit="KB")

0

Index

39.0625

int64

39.0625

float64

39.0625

bool

4.88281

To get the approximate total memory usage:

>>>  df.memory_usage(index=True).sum()
memory_usage_info(unit='GB')[source]

A formatted string representation of the size of this DataFrame.

Parameters:

unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.

Returns:

A string representation of the number of bytes used by this DataFrame in [unit]s.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)})
>>> df.memory_usage_info()
'0.00 GB'
>>> df.memory_usage_info(unit="KB")
'15 KB'
merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

arkouda.dataframe.DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> display(left_df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> display(right_df)

col1

col2

0

0

0

1

2

2

2

4

4

3

6

6

4

8

8

>>> left_df.merge(right_df, on = "col1")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

>>> left_df.merge(right_df, on = "col1", how = "left")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

>>> left_df.merge(right_df, on = "col1", how = "right")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

3

6

nan

6

4

8

nan

8

>>> left_df.merge(right_df, on = "col1", how = "outer")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

5

6

6

nan

6

8

8

nan

notna() DataFrame[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.notna()
       A      B      C     D
0  False   True   True  True
1   True  False  False  True
2   True   True   True  True
3   True   True  False  True (4 rows x 4 columns)
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

read_csv(filename: str, col_delim: str = ',')[source]

Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.

Parameters:
  • filename (str) – Filename to read data from.

  • col_delim (str, default=",") – The delimiter for columns within the data.

Returns:

Arkouda DataFrame containing the columns from the CSV file.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing

bytes as uint(8).

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path)
>>> df2 = DataFrame.read_csv(my_path + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

register(user_defined_name: str) DataFrame[source]

Register this DataFrame object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.

Returns:

The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • TypeError – Raised if user_defined_name is not a str.

  • RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None[source]

Rename indexes or columns according to a mapping.

Parameters:
  • mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index

  • column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

Rename columns using a mapping:

>>> df.rename(column={'A':'a', 'B':'c'})

a

c

0

1

4

1

2

5

2

3

6

Rename indexes using a mapping:

>>> df.rename(index={0:99, 2:11})

A

B

0

1

4

1

2

5

2

3

6

Rename using an axis style parameter:

>>> df.rename(str.lower, axis='column')

a

b

0

1

4

1

2

5

2

3

6

reset_index(size: int | None = None, inplace: bool = False) None | DataFrame[source]

Set the index to an integer range.

Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.

Parameters:
  • size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Note

Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.

Example

>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

>>> perm_df = df[ak.array([0,2,1])]
>>> display(perm_df)

A

B

0

1

4

1

3

6

2

2

5

>>> perm_df.reset_index()

A

B

0

1

4

1

3

6

2

2

5

sample(n=5)[source]

Return a random sample of n rows.

Parameters:

n (int, default=5) – Number of rows to return.

Returns:

The sampled n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> display(df)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

Random output of size 3:

>>> df.sample(n=3)

A

B

0

0

0

1

1

-1

2

4

-4

save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)[source]

DEPRECATED Save DataFrame to disk, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list, default=None) – List of columns to include in the file. If None, writes out all columns.

  • file_format (str, default='HDF5') – ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’

  • file_type (str, default=distribute) – “single” or “distribute” If single, will right a single file to locale 0.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, to_hdf

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path + '/my_data', file_type="single")
>>> df.load(my_path + '/my_data')

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

property shape

The shape of the dataframe.

Returns:

Tuple of array dimensions.

Return type:

tuple of int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.shape
(3, 2)
property size

Returns the number of bytes on the arkouda server.

Returns:

The number of bytes on the arkouda server.

Return type:

int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.size
6
sort_index(ascending=True)[source]

Sort the DataFrame by indexed columns.

Note: Fails on sort order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:

ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Example

>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]},
...          index = Index(ak.array([2,0,1]), name="idx"))
>>> display(df)

idx

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.sort_index()

idx

col1

col2

0

3.1

5

1

2.1

4

2

1.1

6

sort_values(by=None, ascending=True)[source]

Sort the DataFrame by one or more columns.

If no column is specified, all columns are used.

Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:
  • by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.

  • ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.sort_values()

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

>>> df.sort_values("col3")

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

tail(n=5)[source]

Return the last n rows.

This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.

Parameters:

n (int, default=5) – Number of rows to select.

Returns:

The last n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

arkouda.dataframe.head

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.tail()

col1

col2

0

5

-5

1

6

-6

2

7

-7

3

8

-8

4

9

-9

>>> df.tail(n=2)

col1

col2

0

8

-8

1

9

-9

to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.

  • columns (list of str (Optional)) – Column names to assign when writing data.

  • col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

None

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path + "/my_data")
>>> df2 = DataFrame.read_csv(my_path + "/my_data" + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

to_hdf(path, index=False, columns=None, file_type='distribute')[source]

Save DataFrame to disk as hdf5, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.

  • file_type (str (single | distribute), default=distribute) – Whether to save to a single file or distribute across Locales.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print DataFrame in Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) – Add index (row) labels.

  • tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/

  • storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.

  • **kwargs – These parameters will be passed to tabulate.

Note

This function should only be called on small DataFrames as it calls pandas.DataFrame.to_markdown: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
>>> print(df.to_markdown())
+----+------------+------------+
|    | animal_1   | animal_2   |
+====+============+============+
|  0 | elk        | dog        |
+----+------------+------------+
|  1 | pig        | quetzal    |
+----+------------+------------+

Suppress the index:

>>> print(df.to_markdown(index = False))
+------------+------------+
| animal_1   | animal_2   |
+============+============+
| elk        | dog        |
+------------+------------+
| pig        | quetzal    |
+------------+------------+
to_pandas(datalimit=1073741824, retain_index=False)[source]

Send this DataFrame to a pandas DataFrame.

Parameters:
  • datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.

  • retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.

Returns:

The result of converting this DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)})
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

0

0

1

1

-1

>>> import pandas as pd
>>> pd_df = ak_df.to_pandas()
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

0

0

1

1

-1

to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]

Save DataFrame to disk as parquet, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list) – List of columns to include in the file. If None, writes out all columns.

  • compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

  • convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_hdf, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'parquet_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

B

A

0

3

1

1

4

2

transfer(hostname, port)[source]

Sends a DataFrame to a different Arkouda server.

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Returns:

A message indicating a complete transfer.

Return type:

str

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister()[source]

Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
unregister_dataframe_by_name(user_defined_name: str) str[source]

Function to unregister DataFrame object by name which was registered with the arkouda server via register().

Parameters:

user_defined_name (str) – Name under which the DataFrame object was registered.

Raises:
  • TypeError – If user_defined_name is not a string.

  • RegistrationError – If there is an issue attempting to unregister any underlying components.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister_dataframe_by_name("my_table_name")
>>> df.is_registered()
False
update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]

Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.

  • repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Returns:

Success message if successful.

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

If file does not contain File_Format attribute to indicate how it was saved,

the file name is checked for _LOCALE#### to determine if it is distributed.

If the dataset provided does not exist, it will be added.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]})
>>> df2.update_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

5

7

1

6

8

update_nrows()[source]

Computes the number of rows on the arkouda server and updates the size parameter.

class arkouda.DataFrame(dict=None, /, **kwargs)[source]

Bases: collections.UserDict

A DataFrame structure based on arkouda arrays.

Parameters:
  • initialdata (List or dictionary of lists, tuples, or pdarrays) – Each list/dictionary entry corresponds to one column of the data and should be a homogenous type. Different columns may have different types. If using a dictionary, keys should be strings.

  • index (Index, pdarray, or Strings) – Index for the resulting frame. Defaults to an integer range.

  • columns (List, tuple, pdarray, or Strings) – Column labels to use if the data does not include them. Elements must be strings. Defaults to an stringified integer range.

Examples

Create an empty DataFrame and add a column of data:

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame()
>>> df['a'] = ak.array([1,2,3])
>>> display(df)

a

0

1

1

2

2

3

Create a new DataFrame using a dictionary of data:

>>> userName = ak.array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> userID = ak.array([111, 222, 111, 333, 222, 111])
>>> item = ak.array([0, 0, 1, 1, 2, 0])
>>> day = ak.array([5, 5, 6, 5, 6, 6])
>>> amount = ak.array([0.5, 0.6, 1.1, 1.2, 4.3, 0.6])
>>> df = ak.DataFrame({'userName': userName, 'userID': userID,
>>>            'item': item, 'day': day, 'amount': amount})
>>> display(df)

userName

userID

item

day

amount

0

Alice

111

0

5

0.5

1

Bob

222

0

5

0.6

2

Alice

111

1

6

1.1

3

Carol

333

1

5

1.2

4

Bob

222

2

6

4.3

5

Alice

111

0

6

0.6

Indexing works slightly differently than with pandas:

>>> df[0]

keys

values

userName

Alice

userID

111

item

0

day

5

amount

0.5

>>> df['userID']
array([111, 222, 111, 333, 222, 111])
>>> df['userName']
array(['Alice', 'Bob', 'Alice', 'Carol', 'Bob', 'Alice'])
>>> df[ak.array([1,3,5])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Carol

333

1

5

1.2

2

Alice

111

0

6

0.6

Compute the stride:

>>> df[1:5:1]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

3

Bob

222

2

6

4.3

>>> df[ak.array([1,2,3])]

userName

userID

item

day

amount

0

Bob

222

0

5

0.6

1

Alice

111

1

6

1.1

2

Carol

333

1

5

1.2

>>> df[['userID', 'day']]

userID

day

0

111

5

1

222

5

2

111

6

3

333

5

4

222

6

5

111

6

GroupBy(keys, use_series=False, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=False) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

all(axis=0) Series | bool[source]

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element along a Dataframe axis that is False.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[True,True,True,True]})

A

B

C

D

0

True

True

True

True

1

True

True

False

True

2

True

True

True

True

3

False

False

False

True

>>> df.all(axis=0)
A    False
B    False
C    False
D     True
dtype: bool
>>> df.all(axis=1)
0     True
1    False
2     True
3    False
dtype: bool
>>> df.all(axis=None)
False
any(axis=0) Series | bool[source]

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element along a Dataframe axis that is True.

Currently, will ignore any columns that are not type bool. This is equivalent to the pandas option bool_only=True.

Parameters:

axis ({0 or ‘index’, 1 or ‘columns’, None}, default = 0) –

Indicate which axis or axes should be reduced.

0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

None : reduce all axes, return a scalar.

Return type:

arkouda.series.Series or bool

Raises:

ValueError – Raised if axis does not have a value in {0 or ‘index’, 1 or ‘columns’, None}.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[True,True,True,False],"B":[True,True,True,False],
...          "C":[True,False,True,False],"D":[False,False,False,False]})

A

B

C

D

0

True

True

True

False

1

True

True

False

False

2

True

True

True

False

3

False

False

False

False

>>> df.any(axis=0)
A     True
B     True
C     True
D    False
dtype: bool
>>> df.any(axis=1)
0     True
1     True
2     True
3    False
dtype: bool
>>> df.any(axis=None)
True
append(other, ordered=True)[source]

Concatenate data from ‘other’ onto the end of this DataFrame, in place.

Explicitly, use the arkouda concatenate function to append the data from each column in other to the end of self. This operation is done in place, in the sense that the underlying pdarrays are updated from the result of the arkouda concatenate function, rather than returning a new DataFrame object containing the result.

Parameters:
  • other (DataFrame) – The DataFrame object whose data will be appended to this DataFrame.

  • ordered (bool, default=True) – If False, allow rows to be interleaved for better performance (but data within a row remains together). By default, append all rows to the end, in input order.

Returns:

Appending occurs in-place, but result is returned for compatibility.

Return type:

self

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df1 = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

col1

col2

0

1

3

1

2

4

>>> df2 = ak.DataFrame({'col1': [3], 'col2': [5]})

col1

col2

0

3

5

>>> df1.append(df2)
>>> df1

col1

col2

0

1

3

1

2

4

2

3

5

apply_permutation(perm)[source]

Apply a permutation to an entire DataFrame. The operation is done in place and the original DataFrame will be modified.

This may be useful if you want to unsort an DataFrame, or even to apply an arbitrary permutation such as the inverse of a sorting permutation.

Parameters:

perm (pdarray) – A permutation array. Should be the same size as the data arrays, and should consist of the integers [0,size-1] in some order. Very minimal testing is done to ensure this is a permutation.

Return type:

None

See also

sort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

col1

col2

0

1

4

1

2

5

2

3

6

>>> perm_arry = ak.array([0, 2, 1])
>>> df.apply_permutation(perm_arry)
>>> display(df)

col1

col2

0

1

4

1

3

6

2

2

5

argsort(key, ascending=True)[source]

Return the permutation that sorts the dataframe by key.

Parameters:
  • key (str) – The key to sort on.

  • ascending (bool, default = True) – If true, sort the key in ascending order. Otherwise, sort the key in descending order.

Returns:

The permutation array that sorts the data on key.

Return type:

arkouda.pdarrayclass.pdarray

See also

coargsort

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]})
>>> display(df)

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.argsort('col1')
array([0 2 1])
>>> sorted_df1 = df[df.argsort('col1')]
>>> display(sorted_df1)

col1

col2

0

1.1

6

1

2.1

4

2

3.1

5

>>> df.argsort('col2')
array([2 1 0])
>>> sorted_df2 = df[df.argsort('col2')]
>>> display(sorted_df2)

col1

col2

0

2.1

4

1

3.1

5

2

1.1

6

assign(**kwargs) DataFrame[source]

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:

**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

A new DataFrame with the new columns in addition to all the existing columns.

Return type:

DataFrame

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Examples

>>> df = ak.DataFrame({'temp_c': [17.0, 25.0]},
...                   index=['Portland', 'Berkeley'])
>>> df
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
...           temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
attach(user_defined_name: str) DataFrame[source]

Function to return a DataFrame object attached to the registered name in the arkouda server which was registered using register().

Parameters:

user_defined_name (str) – user defined name which DataFrame object was registered under.

Returns:

The DataFrame object created by re-attaching to the corresponding server components.

Return type:

arkouda.dataframe.DataFrame

Raises:

RegistrationError – if user_defined_name is not registered

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
coargsort(keys, ascending=True)[source]

Return the permutation that sorts the dataframe by keys.

Note: Sorting using Strings may not yield correct sort order.

Parameters:

keys (list of str) – The keys to sort on.

Returns:

The permutation array that sorts the data on keys.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.coargsort(['col1', 'col2'])
array([2 0 1])
>>>
property columns

An Index where the values are the column names of the dataframe.

Returns:

The values of the index are the column names of the dataframe.

Return type:

arkouda.index.Index

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.columns
Index(array(['col1', 'col2']), dtype='<U0')
concat(items, ordered=True)[source]

Essentially an append, but different formatting.

corr() DataFrame[source]

Return new DataFrame with pairwise correlation of columns.

Returns:

Arkouda DataFrame containing correlation matrix of all columns.

Return type:

arkouda.dataframe.DataFrame

Raises:

RuntimeError – Raised if there’s a server-side error thrown.

See also

pdarray.corr

Notes

Generates the correlation matrix using Pearson R for all columns.

Attempts to convert to numeric values where possible for inclusion in the matrix.

Example

>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [-1, -2]})
>>> display(df)

col1

col2

0

1

-1

1

2

-2

>>> corr = df.corr()

col1

col2

col1

1

-1

col2

-1

1

count(axis: int | str = 0, numeric_only=False) Series[source]

Count non-NA cells for each column or row.

The values np.NaN are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool = False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries.

Return type:

arkouda.series.Series

Raises:

ValueError – Raised if axis is not 0, 1, ‘index’, or ‘columns’.

See also

GroupBy.count

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({'col_A': ak.array([7, np.nan]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

nan

9

>>> df.count()
col_A    1
col_B    2
dtype: int64
>>> df = ak.DataFrame({'col_A': ak.array(["a","b","c"]), 'col_B':ak.array([1, np.nan, np.nan])})
>>> display(df)

col_A

col_B

0

a

1

1

b

nan

2

c

nan

>>> df.count()
col_A    3
col_B    1
dtype: int64
>>> df.count(numeric_only=True)
col_B    1
dtype: int64
>>> df.count(axis=1)
0    2
1    1
2    1
dtype: int64
drop(keys: str | int | List[str | int], axis: str | int = 0, inplace: bool = False) None | DataFrame[source]

Drop column/s or row/s from the dataframe.

Parameters:
  • keys (str, int or list) – The labels to be dropped on the given axis.

  • axis (int or str) – The axis on which to drop from. 0/’index’ - drop rows, 1/’columns’ - drop columns.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> display(df)

col1

col2

0

1

3

1

2

4

Drop column

>>> df.drop('col1', axis = 1)

col2

0

3

1

4

Drop row

>>> df.drop(0, axis = 0)

col1

col2

0

2

4

drop_duplicates(subset=None, keep='first')[source]

Drops duplcated rows and returns resulting DataFrame.

If a subset of the columns are provided then only one instance of each duplicated row will be returned (keep determines which row).

Parameters:
  • subset (Iterable) – Iterable of column names to use to dedupe.

  • keep ({'first', 'last'}, default='first') – Determines which duplicates (if any) to keep.

Returns:

DataFrame with duplicates removed.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 3], 'col2': [4, 5, 5, 6]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

5

3

3

6

>>> df.drop_duplicates()

col1

col2

0

1

4

1

2

5

2

3

6

dropna(axis: int | str = 0, how: str | None = None, thresh: int | None = None, ignore_index: bool = False) DataFrame[source]

Remove missing values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default = 0) –

    Determine if rows or columns which contain missing values are removed.

    0, or ‘index’: Drop rows which contain missing values.

    1, or ‘columns’: Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default='any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    ’any’: If any NA values are present, drop that row or column.

    ’all’: If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non - NA values.Cannot be combined with how.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DataFrame with NA entries dropped from it.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame(
    {
        "A": [True, True, True, True],
        "B": [1, np.nan, 2, np.nan],
        "C": [1, 2, 3, np.nan],
        "D": [False, False, False, False],
        "E": [1, 2, 3, 4],
        "F": ["a", "b", "c", "d"],
        "G": [1, 2, 3, 4],
    }
   )
>>> display(df)

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

>>> df.dropna()

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

2

3

False

3

c

3

>>> df.dropna(axis=1)

A

D

E

F

G

0

True

False

1

a

1

1

True

False

2

b

2

2

True

False

3

c

3

3

True

False

4

d

4

>>> df.dropna(axis=1, thresh=3)

A

C

D

E

F

G

0

True

1

False

1

a

1

1

True

2

False

2

b

2

2

True

3

False

3

c

3

3

True

nan

False

4

d

4

>>> df.dropna(axis=1, how="all")

A

B

C

D

E

F

G

0

True

1

1

False

1

a

1

1

True

nan

2

False

2

b

2

2

True

2

3

False

3

c

3

3

True

nan

nan

False

4

d

4

property dtypes: DataFrame

The dtypes of the dataframe.

Returns:

dtypes – The dtypes of the dataframe.

Return type:

arkouda.row.Row

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.dtypes

keys

values

col1

int64

col2

str

property empty: DataFrame

Whether the dataframe is empty.

Returns:

True if the dataframe is empty, otherwise False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({})
>>> df
 0 rows x 0 columns
>>> df.empty
True
filter_by_range(keys, low=1, high=None)[source]

Find all rows where the value count of the items in a given set of columns (keys) is within the range [low, high].

To filter by a specific value, set low == high.

Parameters:
  • keys (str or list of str) – The names of the columns to group by.

  • low (int, default=1) – The lowest value count.

  • high (int, default=None) – The highest value count, default to unlimited.

Returns:

An array of boolean values for qualified rows in this DataFrame.

Return type:

arkouda.pdarrayclass.pdarray

Example

>>> df = ak.DataFrame({'col1': [1, 2, 2, 2, 3, 3], 'col2': [4, 5, 6, 7, 8, 9]})
>>> display(df)

col1

col2

0

1

4

1

2

5

2

2

6

3

2

7

4

3

8

5

3

9

>>> df.filter_by_range("col1", low=1, high=2)
array([True False False False True True])
>>> filtered_df = df[df.filter_by_range("col1", low=1, high=2)]
>>> display(filtered_df)

col1

col2

0

1

4

1

3

8

2

3

9

from_pandas(pd_df)[source]

Copy the data from a pandas DataFrame into a new arkouda.dataframe.DataFrame.

Parameters:

pd_df (pandas.DataFrame) – A pandas DataFrame to convert.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import pandas as pd
>>> pd_df = pd.DataFrame({"A":[1,2],"B":[3,4]})
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

1

3

1

2

4

>>> ak_df = DataFrame.from_pandas(pd_df)
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

1

3

1

2

4

from_return_msg(rep_msg)[source]

Creates a DataFrame object from an arkouda server response message.

Parameters:

rep_msg (string) – Server response message used to create a DataFrame.

Return type:

arkouda.dataframe.DataFrame

groupby(keys, use_series=True, as_index=True, dropna=True)[source]

Group the dataframe by a column or a list of columns. Alias for GroupBy.

Parameters:
  • keys (str or list of str) – An (ordered) list of column names or a single string to group by.

  • use_series (bool, default=True) – If True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise an arkouda.groupbyclass.GroupBy object.

  • as_index (bool, default=True) – If True, groupby columns will be set as index otherwise, the groupby columns will be treated as DataFrame columns.

  • dropna (bool, default=True) – If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Returns:

If use_series = True, returns an arkouda.dataframe.DataFrameGroupBy object. Otherwise returns an arkouda.groupbyclass.GroupBy object.

Return type:

arkouda.dataframe.DataFrameGroupBy or arkouda.groupbyclass.GroupBy

See also

arkouda.GroupBy

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1.0, 1.0, 2.0, np.nan], 'col2': [4, 5, 6, 7]})
>>> df

col1

col2

0

1

4

1

1

5

2

2

6

3

nan

7

>>> df.GroupBy("col1")
<arkouda.groupbyclass.GroupBy at 0x7f2cf23e10c0>
>>> df.GroupBy("col1").size()
(array([1.00000000000000000 2.00000000000000000]), array([2 1]))
>>> df.GroupBy("col1",use_series=True)
col1
1.0    2
2.0    1
dtype: int64
>>> df.GroupBy("col1",use_series=True, as_index = False).size()

col1

size

0

1

2

1

2

1

head(n=5)[source]

Return the first n rows.

This function returns the first n rows of the the dataframe. It is useful for quickly verifying data, for example, after sorting or appending rows.

Parameters:

n (int, default = 5) – Number of rows to select.

Returns:

The first n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

tail

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.head()

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> df.head(n=2)

col1

col2

0

0

0

1

1

-1

property index

The index of the dataframe.

Returns:

The index of the dataframe.

Return type:

arkouda.index.Index or arkouda.index.MultiIndex

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df

col1

col2

0

1

3

1

2

4

>>> df.index
Index(array([0 1]), dtype='int64')
property info

Returns a summary string of this dataframe.

Returns:

A summary string of this dataframe.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2], 'col2': ["a", "b"]})
>>> df

col1

col2

0

1

a

1

2

b

>>> df.info
"DataFrame(['col1', 'col2'], 2 rows, 20 B)"
is_registered() bool[source]

Return True if the object is contained in the registry.

Returns:

Indicates if the object is contained in the registry.

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
isin(values: pdarray | Dict | Series | DataFrame) DataFrame[source]

Determine whether each element in the DataFrame is contained in values.

Parameters:

values (pdarray, dict, Series, or DataFrame) – The values to check for in DataFrame. Series can only have a single index.

Returns:

Arkouda DataFrame of booleans showing whether each element in the DataFrame is contained in values.

Return type:

arkouda.dataframe.DataFrame

See also

ak.Series.isin

Notes

  • Pandas supports values being an iterable type. In arkouda, we replace this with pdarray.

  • Pandas supports ~ operations. Currently, ak.DataFrame does not support this.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col_A': ak.array([7, 3]), 'col_B':ak.array([1, 9])})
>>> display(df)

col_A

col_B

0

7

1

1

3

9

When values is a pdarray, check every value in the DataFrame to determine if it exists in values.

>>> df.isin(ak.array([0, 1]))

col_A

col_B

0

0

1

1

0

0

When values is a dict, the values in the dict are passed to check the column indicated by the key.

>>> df.isin({'col_A': ak.array([0, 3])})

col_A

col_B

0

0

0

1

1

0

When values is a Series, each column is checked if values is present positionally. This means that for True to be returned, the indexes must be the same.

>>> i = ak.Index(ak.arange(2))
>>> s = ak.Series(data=[3, 9], index=i)
>>> df.isin(s)

col_A

col_B

0

0

0

1

0

1

When values is a DataFrame, the index and column must match. Note that 9 is not found because the column name does not match.

>>> other_df = ak.DataFrame({'col_A':ak.array([7, 3]), 'col_C':ak.array([0, 9])})
>>> df.isin(other_df)

col_A

col_B

0

1

0

1

1

0

isna() DataFrame[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. numpy.NaN values get mapped to True values. Everything else gets mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.isna()
       A      B      C      D
0   True  False  False  False
1  False   True   True  False
2  False  False  False  False
3  False  False   True  False (4 rows x 4 columns)
load(prefix_path, file_format='INFER')[source]

Load dataframe from file. file_format needed for consistency with other load functions.

Parameters:
  • prefix_path (str) – The prefix path for the data.

  • file_format (string, default = "INFER")

Returns:

A dataframe loaded from the prefix_path.

Return type:

arkouda.dataframe.DataFrame

Examples

To store data in <my_dir>/my_data_LOCALE0000, use “<my_dir>/my_data” as the prefix.

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path, file_type="distribute")
>>> df.load(my_path)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

memory_usage(index=True, unit='B') Series[source]

Return the memory usage of each column in bytes.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.

  • unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

Return type:

Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> dtypes = [ak.int64, ak.float64,  ak.bool]
>>> data = dict([(str(t), ak.ones(5000, dtype=ak.int64).astype(t)) for t in dtypes])
>>> df = ak.DataFrame(data)
>>> display(df.head())

int64

float64

bool

0

1

1

True

1

1

1

True

2

1

1

True

3

1

1

True

4

1

1

True

>>> df.memory_usage()

0

Index

40000

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(index=False)

0

int64

40000

float64

40000

bool

5000

>>> df.memory_usage(unit="KB")

0

Index

39.0625

int64

39.0625

float64

39.0625

bool

4.88281

To get the approximate total memory usage:

>>>  df.memory_usage(index=True).sum()
memory_usage_info(unit='GB')[source]

A formatted string representation of the size of this DataFrame.

Parameters:

unit (str, default = "GB") – Unit to return. One of {‘KB’, ‘MB’, ‘GB’}.

Returns:

A string representation of the number of bytes used by this DataFrame in [unit]s.

Return type:

str

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(1000), 'col2': ak.arange(1000)})
>>> df.memory_usage_info()
'0.00 GB'
>>> df.memory_usage_info(unit="KB")
'15 KB'
merge(right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how ({"inner", "left", "right}, default = "inner") – The merge condition. Must be “inner”, “left”, or “right”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

arkouda.dataframe.DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> display(left_df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> display(right_df)

col1

col2

0

0

0

1

2

2

2

4

4

3

6

6

4

8

8

>>> left_df.merge(right_df, on = "col1")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

>>> left_df.merge(right_df, on = "col1", how = "left")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

>>> left_df.merge(right_df, on = "col1", how = "right")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

3

6

nan

6

4

8

nan

8

>>> left_df.merge(right_df, on = "col1", how = "outer")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

5

6

6

nan

6

8

8

nan

notna() DataFrame[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. numpy.NaN values get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import numpy as np
>>> df = ak.DataFrame({"A": [np.nan, 2, 2, 3], "B": [3, np.nan, 5, 6],
...          "C": [1, np.nan, 2, np.nan], "D":["a","b","c","d"]})
>>> display(df)

A

B

C

D

0

nan

3

1

a

1

2

nan

nan

b

2

2

5

2

c

3

3

6

nan

d

>>> df.notna()
       A      B      C     D
0  False   True   True  True
1   True  False  False  True
2   True   True   True  True
3   True   True  False  True (4 rows x 4 columns)
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

read_csv(filename: str, col_delim: str = ',')[source]

Read the columns of a CSV file into an Arkouda DataFrame. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings objects.

Parameters:
  • filename (str) – Filename to read data from.

  • col_delim (str, default=",") – The delimiter for columns within the data.

Returns:

Arkouda DataFrame containing the columns from the CSV file.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing

bytes as uint(8).

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output','my_data')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path)
>>> df2 = DataFrame.read_csv(my_path + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

register(user_defined_name: str) DataFrame[source]

Register this DataFrame object and underlying components with the Arkouda server.

Parameters:

user_defined_name (str) – User defined name the DataFrame is to be registered under. This will be the root name for underlying components.

Returns:

The same DataFrame which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different DataFrames with the same name.

Return type:

arkouda.dataframe.DataFrame

Raises:
  • TypeError – Raised if user_defined_name is not a str.

  • RegistrationError – If the server was unable to register the DataFrame with the user_defined_name.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Any changes made to a DataFrame object after registering with the server may not be reflected in attached copies.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
rename(mapper: Callable | Dict | None = None, index: Callable | Dict | None = None, column: Callable | Dict | None = None, axis: str | int = 0, inplace: bool = False) DataFrame | None[source]

Rename indexes or columns according to a mapping.

Parameters:
  • mapper (callable or dict-like, Optional) – Function or dictionary mapping existing values to new values. Nonexistent names will not raise an error. Uses the value of axis to determine if renaming column or index

  • column (callable or dict-like, Optional) – Function or dictionary mapping existing column names to new column names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • index (callable or dict-like, Optional) – Function or dictionary mapping existing index names to new index names. Nonexistent names will not raise an error. When this is set, axis is ignored.

  • axis (int or str, default=0) – Indicates which axis to perform the rename. 0/”index” - Indexes 1/”column” - Columns

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

Rename columns using a mapping:

>>> df.rename(column={'A':'a', 'B':'c'})

a

c

0

1

4

1

2

5

2

3

6

Rename indexes using a mapping:

>>> df.rename(index={0:99, 2:11})

A

B

0

1

4

1

2

5

2

3

6

Rename using an axis style parameter:

>>> df.rename(str.lower, axis='column')

a

b

0

1

4

1

2

5

2

3

6

reset_index(size: int | None = None, inplace: bool = False) None | DataFrame[source]

Set the index to an integer range.

Useful if this dataframe is the result of a slice operation from another dataframe, or if you have permuted the rows and no longer need to keep that ordering on the rows.

Parameters:
  • size (int, optional) – If size is passed, do not attempt to determine size based on existing column sizes. Assume caller handles consistency correctly.

  • inplace (bool, default=False) – When True, perform the operation on the calling object. When False, return a new object.

Returns:

DateFrame when inplace=False; None when inplace=True.

Return type:

arkouda.dataframe.DataFrame or None

Note

Pandas adds a column ‘index’ to indicate the original index. Arkouda does not currently support this behavior.

Example

>>> df = ak.DataFrame({"A": ak.array([1, 2, 3]), "B": ak.array([4, 5, 6])})
>>> display(df)

A

B

0

1

4

1

2

5

2

3

6

>>> perm_df = df[ak.array([0,2,1])]
>>> display(perm_df)

A

B

0

1

4

1

3

6

2

2

5

>>> perm_df.reset_index()

A

B

0

1

4

1

3

6

2

2

5

sample(n=5)[source]

Return a random sample of n rows.

Parameters:

n (int, default=5) – Number of rows to return.

Returns:

The sampled n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

Example

>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> display(df)

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

Random output of size 3:

>>> df.sample(n=3)

A

B

0

0

0

1

1

-1

2

4

-4

save(path, index=False, columns=None, file_format='HDF5', file_type='distribute', compression: str | None = None)[source]

DEPRECATED Save DataFrame to disk, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list, default=None) – List of columns to include in the file. If None, writes out all columns.

  • file_format (str, default='HDF5') – ‘HDF5’ or ‘Parquet’. Defaults to ‘HDF5’

  • file_type (str, default=distribute) – “single” or “distribute” If single, will right a single file to locale 0.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Compression type. Only used for Parquet

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, to_hdf

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf5_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A": ak.arange(5), "B": -1 * ak.arange(5)})
>>> df.save(my_path + '/my_data', file_type="single")
>>> df.load(my_path + '/my_data')

A

B

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

property shape

The shape of the dataframe.

Returns:

Tuple of array dimensions.

Return type:

tuple of int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.shape
(3, 2)
property size

Returns the number of bytes on the arkouda server.

Returns:

The number of bytes on the arkouda server.

Return type:

int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df

col1

col2

0

1

4

1

2

5

2

3

6

>>> df.size
6
sort_index(ascending=True)[source]

Sort the DataFrame by indexed columns.

Note: Fails on sort order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:

ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Example

>>> df = ak.DataFrame({'col1': [1.1, 3.1, 2.1], 'col2': [6, 5, 4]},
...          index = Index(ak.array([2,0,1]), name="idx"))
>>> display(df)

idx

col1

col2

0

1.1

6

1

3.1

5

2

2.1

4

>>> df.sort_index()

idx

col1

col2

0

3.1

5

1

2.1

4

2

1.1

6

sort_values(by=None, ascending=True)[source]

Sort the DataFrame by one or more columns.

If no column is specified, all columns are used.

Note: Fails on order of arkouda.strings.Strings columns when multiple columns being sorted.

Parameters:
  • by (str or list/tuple of str, default = None) – The name(s) of the column(s) to sort by.

  • ascending (bool, default = True) – Sort values in ascending (default) or descending order.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': [2, 2, 1], 'col2': [3, 4, 3], 'col3':[5, 6, 7]})
>>> display(df)

col1

col2

col3

0

2

3

5

1

2

4

6

2

1

3

7

>>> df.sort_values()

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

>>> df.sort_values("col3")

col1

col2

col3

0

1

3

7

1

2

3

5

2

2

4

6

tail(n=5)[source]

Return the last n rows.

This function returns the last n rows for the dataframe. It is useful for quickly testing if your object has the right type of data in it.

Parameters:

n (int, default=5) – Number of rows to select.

Returns:

The last n rows of the DataFrame.

Return type:

arkouda.dataframe.DataFrame

See also

arkouda.dataframe.head

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({'col1': ak.arange(10), 'col2': -1 * ak.arange(10)})
>>> display(df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

5

5

-5

6

6

-6

7

7

-7

8

8

-8

9

9

-9

>>> df.tail()

col1

col2

0

5

-5

1

6

-6

2

7

-7

3

8

-8

4

9

-9

>>> df.tail(n=2)

col1

col2

0

8

-8

1

9

-9

to_csv(path: str, index: bool = False, columns: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Writes DataFrame to CSV file(s). File will contain a column for each column in the DataFrame. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • index (bool, default=False) – If True, the index of the DataFrame will be written to the file as a column.

  • columns (list of str (Optional)) – Column names to assign when writing data.

  • col_delim (str, default=",") – Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool, default=False) – If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

None

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server.

Notes

  • CSV format is not currently supported by load/load_all operations.

  • The column delimiter is expected to be the same for column names and data.

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (”\n”) at this time.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'csv_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_csv(my_path + "/my_data")
>>> df2 = DataFrame.read_csv(my_path + "/my_data" + "_LOCALE0000")
>>> display(df2)

A

B

0

1

3

1

2

4

to_hdf(path, index=False, columns=None, file_type='distribute')[source]

Save DataFrame to disk as hdf5, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default = None) – List of columns to include in the file. If None, writes out all columns.

  • file_type (str (single | distribute), default=distribute) – Whether to save to a single file or distribute across Locales.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_parquet, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print DataFrame in Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) – Add index (row) labels.

  • tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/

  • storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.

  • **kwargs – These parameters will be passed to tabulate.

Note

This function should only be called on small DataFrames as it calls pandas.DataFrame.to_markdown: https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.to_markdown.html

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
>>> print(df.to_markdown())
+----+------------+------------+
|    | animal_1   | animal_2   |
+====+============+============+
|  0 | elk        | dog        |
+----+------------+------------+
|  1 | pig        | quetzal    |
+----+------------+------------+

Suppress the index:

>>> print(df.to_markdown(index = False))
+------------+------------+
| animal_1   | animal_2   |
+============+============+
| elk        | dog        |
+------------+------------+
| pig        | quetzal    |
+------------+------------+
to_pandas(datalimit=1073741824, retain_index=False)[source]

Send this DataFrame to a pandas DataFrame.

Parameters:
  • datalimit (int, default=arkouda.client.maxTransferBytes) – The maximum number size, in megabytes to transfer. The requested DataFrame will be converted to a pandas DataFrame only if the estimated size of the DataFrame does not exceed this value.

  • retain_index (bool, default=False) – Normally, to_pandas() creates a new range index object. If you want to keep the index column, set this to True.

Returns:

The result of converting this DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> ak_df = ak.DataFrame({"A": ak.arange(2), "B": -1 * ak.arange(2)})
>>> type(ak_df)
arkouda.dataframe.DataFrame
>>> display(ak_df)

A

B

0

0

0

1

1

-1

>>> import pandas as pd
>>> pd_df = ak_df.to_pandas()
>>> type(pd_df)
pandas.core.frame.DataFrame
>>> display(pd_df)

A

B

0

0

0

1

1

-1

to_parquet(path, index=False, columns=None, compression: str | None = None, convert_categoricals: bool = False)[source]

Save DataFrame to disk as parquet, preserving column names.

Parameters:
  • path (str) – File path to save data.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (list) – List of columns to include in the file. If None, writes out all columns.

  • compression (str (Optional), default=None) – Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

  • convert_categoricals (bool, default=False) – Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. If set, write the equivalent Strings in place of any Categorical columns.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

This method saves one file per locale of the arkouda server. All files are prefixed by the path argument and suffixed by their locale number.

See also

to_hdf, load

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'parquet_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_parquet(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

B

A

0

3

1

1

4

2

transfer(hostname, port)[source]

Sends a DataFrame to a different Arkouda server.

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the DataFrame is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Returns:

A message indicating a complete transfer.

Return type:

str

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister()[source]

Unregister this DataFrame object in the arkouda server which was previously registered using register() and/or attached to using attach().

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister.

Notes

Objects registered with the server are immune to deletion until they are unregistered.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister()
>>> df.is_registered()
False
unregister_dataframe_by_name(user_defined_name: str) str[source]

Function to unregister DataFrame object by name which was registered with the arkouda server via register().

Parameters:

user_defined_name (str) – Name under which the DataFrame object was registered.

Raises:
  • TypeError – If user_defined_name is not a string.

  • RegistrationError – If there is an issue attempting to unregister any underlying components.

Example

>>> df = ak.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df.register("my_table_name")
>>> df.attach("my_table_name")
>>> df.is_registered()
True
>>> df.unregister_dataframe_by_name("my_table_name")
>>> df.is_registered()
False
update_hdf(prefix_path: str, index=False, columns=None, repack: bool = True)[source]

Overwrite the dataset with the name provided with this dataframe. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share.

  • index (bool, default=False) – If True, save the index column. By default, do not save the index.

  • columns (List, default=None) – List of columns to include in the file. If None, writes out all columns.

  • repack (bool, default=True) – HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Returns:

Success message if successful.

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

If file does not contain File_Format attribute to indicate how it was saved,

the file name is checked for _LOCALE#### to determine if it is distributed.

If the dataset provided does not exist, it will be added.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> import os.path
>>> from pathlib import Path
>>> my_path = os.path.join(os.getcwd(), 'hdf_output')
>>> Path(my_path).mkdir(parents=True, exist_ok=True)
>>> df = ak.DataFrame({"A":[1,2],"B":[3,4]})
>>> df.to_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

1

3

1

2

4

>>> df2 = ak.DataFrame({"A":[5,6],"B":[7,8]})
>>> df2.update_hdf(my_path + "/my_data")
>>> df.load(my_path + "/my_data")

A

B

0

5

7

1

6

8

update_nrows()[source]

Computes the number of rows on the arkouda server and updates the size parameter.

class arkouda.DataFrameGroupBy[source]

A DataFrame that has been grouped by a subset of columns.

Parameters:
  • gb_key_names (str or list(str), default=None) – The column name(s) associated with the aggregated columns.

  • as_index (bool, default=True) – If True, interpret aggregated column as index (only implemented for single dimensional aggregates). Otherwise, treat aggregated column as a dataframe column.

gb

GroupBy object, where the aggregation keys are values of column(s) of a dataframe, usually in preparation for aggregating with respect to the other columns.

Type:

arkouda.groupbyclass.GroupBy

df

The dataframe containing the original data.

Type:

arkouda.dataframe.DataFrame

gb_key_names

The column name(s) associated with the aggregated columns.

Type:

str or list(str)

as_index

If True the grouped values of the aggregation keys will be treated as an index.

Type:

bool, default=True

all(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

any(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

argmax(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

argmin(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

broadcast(x, permute=True)[source]

Fill each group’s segment with a constant value.

Parameters:
  • x (Series or pdarray) – The values to put in each group’s segment.

  • permute (bool, default=True) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

A Series with the Index of the original frame and the values of the broadcast.

Return type:

arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.dataframe import DataFrameGroupBy
>>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]})

A

B

0

1

3

1

2

4

2

2

5

3

3

6

>>> gb = df.groupby("A")
>>> x = ak.array([10,11,12])
>>> s = DataFrameGroupBy.broadcast(gb, x)
>>> df["C"] = s.values
>>> display(df)

A

B

C

0

1

3

10

1

2

4

11

2

2

5

11

3

3

6

12

count(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

diff(colname)[source]

Create a difference aggregate for the given column.

For each group, the difference between successive values is calculated. Aggregate operations (mean,min,max,std,var) can be done on the results.

Parameters:

colname (str) – Name of the column to compute the difference on.

Returns:

Object containing the differences, which can be aggregated.

Return type:

DiffAggregate

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[1,2,2,2,3,3],"B":[3,9,11,27,86,100]})
>>> display(df)

A

B

0

1

3

1

2

9

2

2

11

3

2

27

4

3

86

5

3

100

>>> gb = df.groupby("A")
>>> gb.diff("B").values
array([nan nan 2.00000000000000000 16.00000000000000000 nan 14.00000000000000000])
first(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

head(n: int = 5, sort_index: bool = True) DataFrame[source]

Return the first n rows from each group.

Parameters:
  • n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the values from that group will be returned.

  • sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> from arkouda import *
>>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})

a

b

0

0

0

1

1

1

2

2

2

3

0

3

4

1

4

5

2

5

6

0

6

7

1

7

8

2

8

9

0

9

>>> df.groupby("a").head(2)

a

b

0

0

0

1

0

3

2

1

1

3

1

4

4

2

2

5

2

5

max(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

mean(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

median(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

min(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

mode(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

nunique(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

prod(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

sample(n=None, frac=None, replace=False, weights=None, random_state=None)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the same row more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the underlying DataFrame and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

Returns:

A new DataFrame containing items randomly sampled from each group sorted according to the grouped columns.

Return type:

DataFrame

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[3,1,2,1,2,3],"B":[3,4,5,6,7,8]})
>>> display(df)
+----+-----+-----+
|    |   A |   B |
+====+=====+=====+
|  0 |   3 |   3 |
+----+-----+-----+
|  1 |   1 |   4 |
+----+-----+-----+
|  2 |   2 |   5 |
+----+-----+-----+
|  3 |   1 |   6 |
+----+-----+-----+
|  4 |   2 |   7 |
+----+-----+-----+
|  5 |   3 |   8 |
+----+-----+-----+
>>> df.groupby("A").sample(random_state=6)

A

B

3

1

6

4

2

7

5

3

8

>>> df.groupby("A").sample(frac=0.5, random_state=3, weights=ak.array([1,1,1,0,0,0]))

A

B

1

1

4

2

2

5

0

3

3

>>> df.groupby("A").sample(n=3, replace=True, random_state=ak.random.default_rng(7))
+----+-----+-----+
|    |   A |   B |
+====+=====+=====+
|  1 |   1 |   4 |
+----+-----+-----+
|  3 |   1 |   6 |
+----+-----+-----+
|  1 |   1 |   4 |
+----+-----+-----+
|  4 |   2 |   7 |
+----+-----+-----+
|  4 |   2 |   7 |
+----+-----+-----+
|  4 |   2 |   7 |
+----+-----+-----+
|  0 |   3 |   3 |
+----+-----+-----+
|  5 |   3 |   8 |
+----+-----+-----+
|  5 |   3 |   8 |
+----+-----+-----+
size(as_series=None, sort_index=True)[source]

Compute the size of each value as the total number of rows, including NaN values.

Parameters:
  • as_series (bool, default=None) – Indicates whether to return arkouda.dataframe.DataFrame (if as_series = False) or arkouda.series.Series (if as_series = True)

  • sort_index (bool, default=True) – If True, results will be returned with index values sorted in ascending order.

Return type:

arkouda.dataframe.DataFrame or arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> df = ak.DataFrame({"A":[1,2,2,3],"B":[3,4,5,6]})
>>> display(df)

A

B

0

1

3

1

2

4

2

2

5

3

3

6

>>> df.groupby("A").size(as_series = False)

size

0

1

1

2

2

1

std(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

sum(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

tail(n: int = 5, sort_index: bool = True) DataFrame[source]

Return the last n rows from each group.

Parameters:
  • n (int, optional, default = 5) – Maximum number of rows to return for each group. If the number of rows in a group is less than n, all the rows from that group will be returned.

  • sort_index (bool, default = True) – If true, return the DataFrame with indices sorted.

Return type:

arkouda.dataframe.DataFrame

Examples

>>> import arkouda as ak
>>> from arkouda import *
>>> df = ak.DataFrame({"a":ak.arange(10) %3 , "b":ak.arange(10)})

a

b

0

0

0

1

1

1

2

2

2

3

0

3

4

1

4

5

2

5

6

0

6

7

1

7

8

2

8

9

0

9

>>> df.groupby("a").tail(2)

a

b

0

0

6

1

0

9

2

1

4

3

1

7

4

2

5

5

2

8

unique(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

var(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

xor(colnames=None)

Aggregate the operation, with the grouped column(s) values as keys.

Parameters:

colnames ((list of) str, default=None) – Column name or list of column names to compute the aggregation over.

Return type:

arkouda.dataframe.DataFrame

class arkouda.DataSource

DataSource(destpath=’.’)

A generic data source file (file, http, ftp, …).

DataSources can be local files or remote files/URLs. The files may also be compressed or uncompressed. DataSource hides some of the low-level details of downloading the file, allowing you to simply pass in a valid file path (or URL) and obtain a file object.

Parameters:

destpath (str or None, optional) – Path to the directory where the source file gets downloaded to for use. If destpath is None, a temporary directory will be created. The default path is the current directory.

Notes

URLs require a scheme string (http://) to be used, without it they will fail:

>>> repos = np.DataSource()
>>> repos.exists('www.google.com/index.html')
False
>>> repos.exists('http://www.google.com/index.html')
True

Temporary directories are deleted when the DataSource is deleted.

Examples

>>> ds = np.DataSource('/home/guido')
>>> urlname = 'http://www.google.com/'
>>> gfile = ds.open('http://www.google.com/')
>>> ds.abspath(urlname)
'/home/guido/www.google.com/index.html'

>>> ds = np.DataSource(None)  # use with temporary file
>>> ds.open('/home/guido/foobar.txt')
<open file '/home/guido.foobar.txt', mode 'r' at 0x91d4430>
>>> ds.abspath('/home/guido/foobar.txt')
'/tmp/.../home/guido/foobar.txt'
abspath(path)[source]

Return absolute path of file in the DataSource directory.

If path is an URL, then abspath will return either the location the file exists locally or the location it would exist when opened using the open method.

Parameters:

path (str) – Can be a local file or a remote URL.

Returns:

out – Complete path, including the DataSource destination directory.

Return type:

str

Notes

The functionality is based on os.path.abspath.

exists(path)[source]

Test if path exists.

Test if path exists as (and in this order):

  • a local file.

  • a remote URL that has been downloaded and stored locally in the DataSource directory.

  • a remote URL that has not been downloaded, but is valid and accessible.

Parameters:

path (str) – Can be a local file or a remote URL.

Returns:

out – True if path exists.

Return type:

bool

Notes

When path is an URL, exists will return True if it’s either stored locally in the DataSource directory, or is a valid remote URL. DataSource does not discriminate between the two, the file is accessible if it exists in either location.

open(path, mode='r', encoding=None, newline=None)[source]

Open and return file-like object.

If path is an URL, it will be downloaded, stored in the DataSource directory and opened from there.

Parameters:
  • path (str) – Local file path or URL to open.

  • mode ({'r', 'w', 'a'}, optional) – Mode to open path. Mode ‘r’ for reading, ‘w’ for writing, ‘a’ to append. Available modes depend on the type of object specified by path. Default is ‘r’.

  • encoding ({None, str}, optional) – Open text file with given encoding. The default encoding will be what io.open uses.

  • newline ({None, str}, optional) – Newline to use when reading text file.

Returns:

out – File object.

Return type:

file object

class arkouda.DateTime64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Datetime(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a date and/or time.

Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.

Parameters:
  • pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

property date
property day
property day_of_week
property day_of_year
property dayofweek
property dayofyear
property hour
property is_leap_year
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isocalendar()[source]
property microsecond
property millisecond
property minute
property month
property nanosecond
register(user_defined_name)[source]

Register this Datetime object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components

Returns:

The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.

Return type:

Datetime

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Datetimes with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property second
special_objType = 'Datetime'
sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

unregister()[source]

Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property week
property weekday
property weekofyear
property year
class arkouda.Datetime(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a date and/or time.

Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.

Parameters:
  • pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

property date
property day
property day_of_week
property day_of_year
property dayofweek
property dayofyear
property hour
property is_leap_year
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isocalendar()[source]
property microsecond
property millisecond
property minute
property month
property nanosecond
register(user_defined_name)[source]

Register this Datetime object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components

Returns:

The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.

Return type:

Datetime

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Datetimes with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property second
special_objType = 'Datetime'
sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

unregister()[source]

Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property week
property weekday
property weekofyear
property year
class arkouda.Datetime(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a date and/or time.

Datetime is the Arkouda analog to pandas DatetimeIndex and other timeseries data types.

Parameters:
  • pda (int64 pdarray, pd.DatetimeIndex, pd.Series, or np.datetime64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

property date
property day
property day_of_week
property day_of_year
property dayofweek
property dayofyear
property hour
property is_leap_year
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isocalendar()[source]
property microsecond
property millisecond
property minute
property month
property nanosecond
register(user_defined_name)[source]

Register this Datetime object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Datetime is to be registered under, this will be the root name for underlying components

Returns:

The same Datetime which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Datetimes with the same name.

Return type:

Datetime

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Datetimes with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property second
special_objType = 'Datetime'
sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas DatetimeIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

unregister()[source]

Unregister this Datetime object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property week
property weekday
property weekofyear
property year
class arkouda.DatetimeAccessor(series)[source]

Bases: Properties

series
class arkouda.DiffAggregate[source]

A column in a GroupBy that has been differenced. Aggregation operations can be done on the result.

gb

GroupBy object, where the aggregation keys are values of column(s) of a dataframe.

Type:

arkouda.groupbyclass.GroupBy

values

A column to compute the difference on.

Type:

arkouda.series.Series.

all()
any()
argmax()
argmin()
count()
first()
max()
mean()
median()
min()
mode()
nunique()
prod()
std()
sum()
unique()
var()
xor()
class arkouda.ErrorMode[source]

An enumeration.

ignore(*args, **kwargs)

An enumeration.

name(*args, **kwargs)

The name of the Enum member.

return_validity(*args, **kwargs)

An enumeration.

strict(*args, **kwargs)

An enumeration.

value(*args, **kwargs)

The value of the Enum member.

class arkouda.False_(value)

Bases: numpy.generic

Boolean type (True or False), stored as a byte.

Warning

The bool_ type is not a subclass of the int_ type (the bool_ is not even a number type). This is different than Python’s default implementation of bool as a sub-class of int.

Character code:

'?'

class arkouda.Fields(values, names, MSB_left=True, pad='-', separator='', show_int=True)[source]

Bases: BitVector

An integer-backed representation of a set of named binary fields, e.g. flags.

Parameters:
  • values (pdarray or Strings) – The array of field values. If (u)int64, the values are used as-is for the binary representation of fields. If Strings, the values are converted to binary according to the mapping defined by the names and MSB_left arguments.

  • names (str or sequence of str) – The names of the fields, in order. A string will be treated as a list of single-character field names. Multi-character field names are allowed, but must be passed as a list or tuple and user must specify a separator.

  • MSB_left (bool) – Controls how field names are mapped to binary values. If True (default), the left-most field name corresponds to the most significant bit in the binary representation. If False, the left-most field name corresponds to the least significant bit.

  • pad (str) – Character to display when field is not present. Use empty string if no padding is desired.

  • separator (str) – Substring that separates fields. Used to parse input values (if ak.Strings) and to display output.

  • show_int (bool) – If True (default), display the integer value of the binary fields in output.

Returns:

fields – The array of field values

Return type:

Fields

Notes

This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.

MSB_left = True
format(x)[source]

Format a single binary value as a string of named fields.

name = None
names
namewidth
opeq(other, op)[source]
pad
padchar = '-'
separator = ''
show_int = True
class arkouda.Float16DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Float32DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Float64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.GROUPBY_REDUCTION_TYPES

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.GroupBy[source]

Group an array or list of arrays by value, usually in preparation for aggregating the within-group values of another array.

Parameters:
  • keys ((list of) pdarray, Strings, or Categorical) – The array to group by value, or if list, the column arrays to group by row

  • assume_sorted (bool) – If True, assume keys is already sorted (Default: False)

nkeys

The number of key arrays (columns)

Type:

int

size[source]

The length of the input array(s), i.e. number of rows

Type:

int

permutation

The permutation that sorts the keys array(s) by value (row)

Type:

pdarray

unique_keys

The unique values of the keys array(s), in grouped order

Type:

(list of) pdarray, Strings, or Categorical

ngroups

The length of the unique_keys array(s), i.e. number of groups

Type:

int

segments

The start index of each group in the grouped array(s)

Type:

pdarray

logger

Used for all logging operations

Type:

ArkoudaLogger

dropna

If True, and the groupby keys contain NaN values, the NaN values together with the corresponding row will be dropped. Otherwise, the rows corresponding to NaN values will be kept.

Type:

bool (default=True)

Raises:

TypeError – Raised if keys is a pdarray with a dtype other than int64

Notes

Integral pdarrays, Strings, and Categoricals are natively supported, but float64 and bool arrays are not.

For a user-defined class to be groupable, it must inherit from pdarray and define or overload the grouping API:

  1. a ._get_grouping_keys() method that returns a list of pdarrays that can be (co)argsorted.

  2. (Optional) a .group() method that returns the permutation that groups the array

If the input is a single array with a .group() method defined, method 2 will be used; otherwise, method 1 will be used.

AND(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise AND of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise AND reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with AND

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise AND of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

OR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise OR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise OR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with OR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise OR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

Reductions(*args, **kwargs)

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

XOR(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Bitwise XOR of values in each segment.

Using the permutation stored in the GroupBy instance, group another array of values and perform a bitwise XOR reduction on each group.

Parameters:

values (pdarray, int64) – The values to group and reduce with XOR

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • result (pdarray, int64) – Bitwise XOR of values in segments corresponding to keys

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not int64

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

aggregate(values: groupable, operator: str, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, groupable][source]

Using the permutation stored in the GroupBy instance, group another array of values and apply a reduction to each group’s values.

Parameters:
  • values (pdarray) – The values to group and reduce

  • operator (str) – The name of the reduction operator to use

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • aggregates (groupable) – One aggregate value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if the requested operator is not supported for the values dtype

Examples

>>> keys = ak.arange(0, 10)
>>> vals = ak.linspace(-1, 1, 10)
>>> g = ak.GroupBy(keys)
>>> g.aggregate(vals, 'sum')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777768,
-0.55555555555555536, -0.33333333333333348, -0.11111111111111116,
0.11111111111111116, 0.33333333333333348, 0.55555555555555536, 0.77777777777777768,
1]))
>>> g.aggregate(vals, 'min')
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([-1, -0.77777777777777779,
-0.55555555555555558, -0.33333333333333337, -0.11111111111111116, 0.11111111111111116,
0.33333333333333326, 0.55555555555555536, 0.77777777777777768, 1]))
all(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “and” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “and”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if all is not supported for the values dtype

any(values: pdarray) Tuple[pdarray | List[pdarray | Strings], pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and perform an “or” reduction on each group.

Parameters:

values (pdarray, bool) – The values to group and reduce with “or”

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_any (pdarray, bool) – One bool per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray or if the pdarray dtype is not bool

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

argmax(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first maximum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmax

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argmaxima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmax(b)
(array([2, 3, 4]), array([9, 3, 2]))
argmin(values: pdarray) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the location of the first minimum of each group’s values.

Parameters:

values (pdarray) – The values to group and find argmin

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_argminima (pdarray, int64) – One index per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if argmax is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if argmin is not supported for the values dtype

Notes

The returned indices refer to the original values array as passed in, not the permutation applied by the GroupBy instance.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.argmin(b)
(array([2, 3, 4]), array([5, 4, 2]))
attach(user_defined_name: str) GroupBy[source]

Function to return a GroupBy object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which GroupBy object was registered under

Returns:

The GroupBy object created by re-attaching to the corresponding server components

Return type:

GroupBy

Raises:

RegistrationError – if user_defined_name is not registered

broadcast(values: pdarray | Strings, permute: bool = True) pdarray | Strings[source]

Fill each group’s segment with a constant value.

Parameters:
  • values (pdarray, Strings) – The values to put in each group’s segment

  • permute (bool) – If True (default), permute broadcast values back to the ordering of the original array on which GroupBy was called. If False, the broadcast values are grouped by value.

Returns:

The broadcasted values

Return type:

pdarray, Strings

Raises:
  • TypeError – Raised if value is not a pdarray object

  • ValueError – Raised if the values array does not have one value per segment

Notes

This function is a sparse analog of np.broadcast. If a GroupBy object represents a sparse matrix (tensor), then this function takes a (dense) column vector and replicates each value to the non-zero elements in the corresponding row.

Examples

>>> a = ak.array([0, 1, 0, 1, 0])
>>> values = ak.array([3, 5])
>>> g = ak.GroupBy(a)
# By default, result is in original order
>>> g.broadcast(values)
array([3, 5, 3, 5, 3])
# With permute=False, result is in grouped order
>>> g.broadcast(values, permute=False)
array([3, 3, 3, 5, 5]
>>> a = ak.randint(1,5,10)
>>> a
array([3, 1, 4, 4, 4, 1, 3, 3, 2, 2])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> g.broadcast(counts > 2)
array([True False True True True False True True False False])
>>> g.broadcast(counts == 3)
array([True False True True True False True True False False])
>>> g.broadcast(counts < 4)
array([True True True True True True True True True True])
build_from_components(user_defined_name: str | None = None, **kwargs) GroupBy[source]

function to build a new GroupBy object from component keys and permutation.

Parameters:
  • user_defined_name (str (Optional) Passing a name will init the new GroupBy) – and assign it the given name

  • kwargs (dict Dictionary of components required for rebuilding the GroupBy.) – Expected keys are “orig_keys”, “permutation”, “unique_keys”, and “segments”

Returns:

The GroupBy object created by using the given components

Return type:

GroupBy

count(values: pdarray) Tuple[groupable, pdarray][source]

Count the number of elements in each group. NaN values will be excluded from the total.

Parameters:

values (pdarray) – The values to be count by group (excluding NaN values).

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears (excluding NaN values).

Examples

>>> a = ak.array([1, 0, -1, 1, 0, -1])
>>> a
array([1 0 -1 1 0 -1])
>>> b = ak.array([1, np.nan, -1, np.nan, np.nan, -1], dtype = "float64")
>>> b
array([1.00000000000000000 nan -1.00000000000000000 nan nan -1.00000000000000000])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.count(b)
>>> keys
array([-1 0 1])
>>> counts
array([2 0 1])
first(values: groupable_element_type) Tuple[groupable, groupable_element_type][source]

First value in each group.

Parameters:

values (pdarray-like) – The values from which to take the first of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first value of each group

from_return_msg(rep_msg)[source]
head(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the first n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The first n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.head(v, 2, return_indices=True)
>>> _, values = g.head(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([0 3 1 4 2 5])
>>> values
array([0 3 1 4 2 5])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.head(v2, 2, return_indices=True)
>>> _, values2 = g.head(v2, 2, return_indices=False)
>>> idx2
array([0 3 1 4 2 5])
>>> values2
array([0 -6 -2 -8 -4 -10])
is_registered() bool[source]

Return True if the object is contained in the registry

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mismatch of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

max(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the maximum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find maxima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_maxima (pdarray) – One maximum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if max is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if max is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.max(b)
(array([2, 3, 4]), array([4, 4, 3]))
mean(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the mean of each group’s values.

Parameters:
  • values (pdarray) – The values to group and average

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_means (pdarray, float64) – One mean value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.mean(b)
(array([2, 3, 4]), array([2.6666666666666665, 2.7999999999999998, 3]))
median(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the median of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find median

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_medians (pdarray, float64) – One median value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,9)
>>> a
array([4 1 4 3 2 2 2 3 3])
>>> g = ak.GroupBy(a)
>>> g.keys
array([4 1 4 3 2 2 2 3 3])
>>> b = ak.linspace(-5,5,9)
>>> b
array([-5 -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5])
>>> g.median(b)
(array([1 2 3 4]), array([-3.75 1.25 3.75 -3.75]))
min(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the minimum of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find minima

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_minima (pdarray) – One minimum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object or if min is not supported for the values dtype

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if min is not supported for the values dtype

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.min(b)
(array([2, 3, 4]), array([1, 1, 3]))
mode(values: groupable) Tuple[groupable, groupable][source]

Most common value in each group. If a group is multi-modal, return the modal value that occurs first.

Parameters:

values ((list of) pdarray-like) – The values from which to take the mode of each group

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) pdarray-like) – The most common value of each group

most_common(values)[source]

(Deprecated) See GroupBy.mode().

nunique(values: groupable) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and return the number of unique values in each group.

Parameters:

values (pdarray, int64) – The values to group and find unique values

Returns:

  • unique_keys (groupable) – The unique keys, in grouped order

  • group_nunique (groupable) – Number of unique values per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the dtype(s) of values array(s) does/do not support the nunique method

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if nunique is not supported for the values dtype

Examples

>>> data = ak.array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> data
array([3, 4, 3, 1, 1, 4, 3, 4, 1, 4])
>>> labels = ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> labels
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g = ak.GroupBy(labels)
>>> g.keys
ak.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4])
>>> g.nunique(data)
array([1,2,3,4]), array([2, 2, 3, 1])
#    Group (1,1,1) has values [3,4,3] -> there are 2 unique values 3&4
#    Group (2,2,2) has values [1,1,4] -> 2 unique values 1&4
#    Group (3,3,3) has values [3,4,1] -> 3 unique values
#    Group (4) has values [4] -> 1 unique value
objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

prod(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the product of each group’s values.

Parameters:
  • values (pdarray) – The values to group and multiply

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_products (pdarray, float64) – One product per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

  • RuntimeError – Raised if prod is not supported for the values dtype

Notes

The return dtype is always float64.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.prod(b)
(array([2, 3, 4]), array([12, 108.00000000000003, 8.9999999999999982]))
register(user_defined_name: str) GroupBy[source]

Register this GroupBy object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the GroupBy is to be registered under, this will be the root name for underlying components

Returns:

The same GroupBy which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different GroupBys with the same name.

Return type:

GroupBy

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the GroupBy with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

sample(values: groupable, n=None, frac=None, replace=False, weights=None, random_state=None, return_indices=False, permute_samples=False)[source]

Return a random sample from each group. You can either specify the number of elements or the fraction of elements to be sampled. random_state can be used for reproducibility

Parameters:
  • values ((list of) pdarray-like) – The values from which to sample, according to their group membership.

  • n (int, optional) – Number of items to return for each group. Cannot be used with frac and must be no larger than the smallest group unless replace is True. Default is one if frac is None.

  • frac (float, optional) – Fraction of items to return. Cannot be used with n.

  • replace (bool, default False) – Allow or disallow sampling of the value more than once.

  • weights (pdarray, optional) – Default None results in equal probability weighting. If passed a pdarray, then values must have the same length as the groupby keys and will be used as sampling probabilities after normalization within each group. Weights must be non-negative with at least one positive element within each group.

  • random_state (int or ak.random.Generator, optional) – If int, seed for random number generator. If ak.random.Generator, use as given.

  • return_indices (bool, default False) – if True, return the indices of the sampled values. Otherwise, return the sample values.

  • permute_samples (bool, default False) – if True, return permute the samples according to group Otherwise, keep samples in original order.

Returns:

if return_indices is True, return the indices of the sampled values. Otherwise, return the sample values.

Return type:

pdarray

size() Tuple[groupable, pdarray][source]

Count the number of elements in each group, i.e. the number of times each key appears. This counts the total number of rows (including NaN values).

Parameters:

none

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • counts (pdarray, int64) – The number of times each unique key appears

See also

count

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 2, 3, 1, 2, 4, 3, 4, 3, 4])
>>> g = ak.GroupBy(a)
>>> keys,counts = g.size()
>>> keys
array([1, 2, 3, 4])
>>> counts
array([1, 2, 4, 3])
std(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the standard deviation of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find standard deviation

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_stds (pdarray, float64) – One std value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.std(b)
(array([2 3 4]), array([1.5275252316519465 1.0954451150103321 0]))
sum(values: pdarray, skipna: bool = True) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and sum each group’s values.

Parameters:
  • values (pdarray) – The values to group and sum

  • skipna (bool) – boolean which determines if NANs should be skipped

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_sums (pdarray) – One sum per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The grouped sum of a boolean pdarray returns integers.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.sum(b)
(array([2, 3, 4]), array([8, 14, 6]))
tail(values: groupable_element_type, n: int = 5, return_indices: bool = True) Tuple[groupable, groupable_element_type][source]

Return the last n values from each group.

Parameters:
  • values ((list of) pdarray-like) – The values from which to select, according to their group membership.

  • n (int, optional, default = 5) – Maximum number of items to return for each group. If the number of values in a group is less than n, all the values from that group will be returned.

  • return_indices (bool, default False) – If True, return the indices of the sampled values. Otherwise, return the selected values.

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result (pdarray-like) – The last n items of each group. If return_indices is True, the result are indices. O.W. the result are values.

Examples

>>> a = ak.arange(10) %3
>>> a
array([0 1 2 0 1 2 0 1 2 0])
>>> v = ak.arange(10)
>>> v
array([0 1 2 3 4 5 6 7 8 9])
>>> g = GroupBy(a)
>>> unique_keys, idx = g.tail(v, 2, return_indices=True)
>>> _, values = g.tail(v, 2, return_indices=False)
>>> unique_keys
array([0 1 2])
>>> idx
array([6 9 4 7 5 8])
>>> values
array([6 9 4 7 5 8])
>>> v2 =  -2 * ak.arange(10)
>>> v2
array([0 -2 -4 -6 -8 -10 -12 -14 -16 -18])
>>> _, idx2 = g.tail(v2, 2, return_indices=True)
>>> _, values2 = g.tail(v2, 2, return_indices=False)
>>> idx2
array([6 9 4 7 5 8])
>>> values2
array([-12 -18 -8 -14 -10 -16])
to_hdf(prefix_path, dataset='groupby', mode='truncate', file_type='distribute')[source]

Save the GroupBy to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Returns:

  • None

  • GroupBy is not currently supported by Parquet

unique(values: groupable)[source]

Return the set of unique values in each group, as a SegArray.

Parameters:

values ((list of) pdarray-like) – The values to unique

Returns:

  • unique_keys ((list of) pdarray-like) – The unique keys, in grouped order

  • result ((list of) SegArray) – The unique values of each group

Raises:

TypeError – Raised if values is or contains Strings or Categorical

unregister()[source]

Unregister this GroupBy object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

unregister_groupby_by_name(user_defined_name: str) None[source]

Function to unregister GroupBy object by name which was registered with the arkouda server via register()

Parameters:

user_defined_name (str) – Name under which the GroupBy object was registered

Raises:
  • TypeError – if user_defined_name is not a string

  • RegistrationError – if there is an issue attempting to unregister any underlying components

update_hdf(prefix_path: str, dataset: str = 'groupby', repack: bool = True)[source]
var(values: pdarray, skipna: bool = True, ddof: int_scalars = 1) Tuple[groupable, pdarray][source]

Using the permutation stored in the GroupBy instance, group another array of values and compute the variance of each group’s values.

Parameters:
  • values (pdarray) – The values to group and find variance

  • skipna (bool) – boolean which determines if NANs should be skipped

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

  • unique_keys ((list of) pdarray or Strings) – The unique keys, in grouped order

  • group_vars (pdarray, float64) – One var value per unique key in the GroupBy instance

Raises:
  • TypeError – Raised if the values array is not a pdarray object

  • ValueError – Raised if the key array size does not match the values size or if the operator is not in the GroupBy.Reductions array

Notes

The return dtype is always float64.

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

Examples

>>> a = ak.randint(1,5,10)
>>> a
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> g = ak.GroupBy(a)
>>> g.keys
array([3, 3, 4, 3, 3, 2, 3, 2, 4, 2])
>>> b = ak.randint(1,5,10)
>>> b
array([3, 3, 3, 4, 1, 1, 3, 3, 3, 4])
>>> g.var(b)
(array([2 3 4]), array([2.333333333333333 1.2 0]))
class arkouda.IPv4(values)[source]

Bases: arkouda.pdarrayclass.pdarray

Represent integers as IPv4 addresses.

Parameters:

values (pdarray, int64) – The integer IP addresses

Returns:

The same IP addresses

Return type:

IPv4

Notes

This class is a thin wrapper around pdarray that mostly affects how values are displayed to the user. Operators and methods will typically treat this class like an int64 pdarray.

export_uint()[source]
format(x)[source]

Format a single integer IP address as a string.

normalize(x)[source]

Take in an IP address as a string, integer, or IPAddress object, and convert it to an integer.

opeq(other, op)[source]
register(user_defined_name)[source]

Register this IPv4 object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the IPv4 is to be registered under, this will be the root name for underlying components

Returns:

The same IPv4 which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different IPv4s with the same name.

Return type:

IPv4

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the IPv4 with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

special_objType = 'IPv4'
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute')[source]

Override of the pdarray to_hdf to store the special object type

to_list()[source]

Export array as a list of integers.

to_ndarray()[source]

Export array as a numpy array of integers.

update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Override the pdarray implementation so that the special object type will be used.

values
class arkouda.Index(values: List | arkouda.pdarrayclass.pdarray | arkouda.Strings | arkouda.Categorical | pandas.Index | Index | pandas.Categorical, name: str | None = None, allow_list=False, max_list_size=1000)[source]
argsort(ascending=True)[source]
concat(other)[source]
equals(other: Index) arkouda.numpy.dtypes.bool_scalars[source]

Whether Indexes are the same size, and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Indexes are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> i = ak.Index([1, 2, 3])
>>> i_cpy = ak.Index([1, 2, 3])
>>> i.equals(i_cpy)
True
>>> i2 = ak.Index([1, 2, 4])
>>> i.equals(i2)
False

MultiIndex case:

>>> arrays = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "blue"])]
>>> m = ak.MultiIndex(arrays, names=["numbers2", "colors2"])
>>> m.equals(m)
True
>>> arrays2 = [ak.array([1, 1, 2, 2]), ak.array(["red", "blue", "red", "green"])]
>>> m2 = ak.MultiIndex(arrays2, names=["numbers2", "colors2"])
>>> m.equals(m2)
False
static factory(index)[source]
classmethod from_return_msg(rep_msg)[source]
property index

This is maintained to support older code

property inferred_type: str

Return a string of the type inferred from the values.

is_registered()[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property is_unique

Property indicating if all values in the index are unique :rtype: bool - True if all values are unique, False otherwise.

lookup(key)[source]
map(arg: dict | arkouda.series.Series) Index[source]

Map values of Index according to an input mapping.

Parameters:

arg (dict or Series) – The mapping correspondence.

Returns:

A new index with the values transformed by the mapping correspondence.

Return type:

arkouda.index.Index

Raises:

TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if index values not of type pdarray, Categorical, or Strings.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> idx = ak.Index(ak.array([2, 3, 2, 3, 4]))
>>> display(idx)
Index(array([2 3 2 3 4]), dtype='int64')
>>> idx.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0})
Index(array([30.00000000000000000 5.00000000000000000 30.00000000000000000
5.00000000000000000 25.00000000000000000]), dtype='float64')
>>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3]))
>>> idx.map(s2)
Index(array(['b', 'b', 'd', 'd', 'a']), dtype='<U0')
max_list_size = 1000
memory_usage(unit='B')[source]

Return the memory usage of the Index values.

Parameters:

unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

Bytes of memory consumed.

Return type:

int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> idx = Index(ak.array([1, 2, 3]))
>>> idx.memory_usage()
24
property names

Return Index or MultiIndex names.

property ndim

Number of dimensions of the underlying data, by definition 1.

See also

MultiIndex.ndim

property nlevels

Integer number of levels in this Index. An Index will always have 1 level. .. seealso:: MultiIndex.nlevels

objType = 'Index'

Sequence used for indexing and alignment.

The basic object storing axis labels for all DataFrame objects.

Parameters:
  • values (List, pdarray, Strings, Categorical, pandas.Categorical, pandas.Index, or Index)

  • name (str, default=None) – Name to be stored in the index.

  • False (allow_list =) – If False, list values will be converted to a pdarray. If True, list values will remain as a list, provided the data length is less than max_list_size.

:paramIf False, list values will be converted to a pdarray.

If True, list values will remain as a list, provided the data length is less than max_list_size.

Parameters:

1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.

Raises:

ValueError – Raised if allow_list=True and the size of values is > max_list_size.

See also

MultiIndex

Examples

>>> ak.Index([1, 2, 3])
Index(array([1 2 3]), dtype='int64')
>>> ak.Index(list('abc'))
Index(array(['a', 'b', 'c']), dtype='<U0')
>>> ak.Index([1, 2, 3], allow_list=True)
Index([1, 2, 3], dtype='int64')
register(user_defined_name)[source]

Register this Index object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components

Returns:

The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.

Return type:

Index

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Index with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
save(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the index to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string. Raised if the Index values are a list.

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used. The file I/O does not rely on the extension to determine the file format.

set_dtype(dtype)[source]

Change the data type of the index

Currently only aku.ip_address and ak.array are supported.

property shape
to_csv(prefix_path: str, dataset: str = 'index', col_delim: str = ',', overwrite: bool = False)[source]

Write Index to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist.

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server. Raised if the Index values are a list.

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_dict(label)[source]
to_hdf(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the Index to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • TypeError – Raised if the Index values are a list.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

to_list()[source]
to_ndarray()[source]
to_pandas()[source]

Return the equivalent Pandas Index.

to_parquet(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', compression: str | None = None)[source]

Save the Index to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • TypeError – Raised if the Index values are a list.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

unregister()[source]

Unregister this Index object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]

Overwrite the dataset with the name provided with this Index object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the index

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

arkouda.Inf: float
arkouda.Infinity: float
class arkouda.Int16DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Int32DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Int64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Int8DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.IntDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

arkouda.LEN_SUFFIX = '_lengths'
class arkouda.LogLevel[source]

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

CRITICAL = 'CRITICAL'
DEBUG = 'DEBUG'
ERROR = 'ERROR'
INFO = 'INFO'
WARN = 'WARN'
class arkouda.LongDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.LongDoubleDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.LongLongDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.MultiIndex(data: list | tuple | pandas.MultiIndex | MultiIndex, name: str | None = None, names: list[str] | None = None)[source]

Bases: Index

argsort(ascending=True)[source]
concat(other)[source]
property dtype: numpy.dtype

Return the dtype object of the underlying data.

equal_levels(other: MultiIndex) bool[source]

Return True if the levels of both MultiIndex objects are the same

get_level_values(level: str | int)[source]
property index

This is maintained to support older code

property inferred_type: str

Return a string of the type inferred from the values.

is_registered()[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

levels: list
lookup(key)[source]
memory_usage(unit='B')[source]

Return the memory usage of the MultiIndex levels.

Parameters:

unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

Bytes of memory consumed.

Return type:

int

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> m = ak.index.MultiIndex([ak.array([1,2,3]),ak.array([4,5,6])])
>>> m.memory_usage()
48
property name

Return Index or MultiIndex name.

property names

Return Index or MultiIndex names.

property ndim

Number of dimensions of the underlying data, by definition 1.

See also

Index.ndim

property nlevels: int

Integer number of levels in this MultiIndex.

See also

Index.nlevels

objType = 'MultiIndex'

Sequence used for indexing and alignment.

The basic object storing axis labels for all DataFrame objects.

Parameters:
  • values (List, pdarray, Strings, Categorical, pandas.Categorical, pandas.Index, or Index)

  • name (str, default=None) – Name to be stored in the index.

  • False (allow_list =) – If False, list values will be converted to a pdarray. If True, list values will remain as a list, provided the data length is less than max_list_size.

:paramIf False, list values will be converted to a pdarray.

If True, list values will remain as a list, provided the data length is less than max_list_size.

Parameters:

1000 (max_list_size =) – This is the maximum allowed data length for the values to be stored as a list object.

Raises:

ValueError – Raised if allow_list=True and the size of values is > max_list_size.

See also

MultiIndex

Examples

>>> ak.Index([1, 2, 3])
Index(array([1 2 3]), dtype='int64')
>>> ak.Index(list('abc'))
Index(array(['a', 'b', 'c']), dtype='<U0')
>>> ak.Index([1, 2, 3], allow_list=True)
Index([1, 2, 3], dtype='int64')
register(user_defined_name)[source]

Register this Index object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Index is to be registered under, this will be the root name for underlying components

Returns:

The same Index which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Indexes with the same name.

Return type:

MultiIndex

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Index with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
set_dtype(dtype)[source]

Change the data type of the index

Currently only aku.ip_address and ak.array are supported.

to_dict(labels=None)[source]
to_hdf(prefix_path: str, dataset: str = 'index', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the Index to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray.

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

to_list()[source]
to_ndarray()[source]
to_pandas()[source]

Return the equivalent Pandas Index.

unregister()[source]

Unregister this Index object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

update_hdf(prefix_path: str, dataset: str = 'index', repack: bool = True)[source]

Overwrite the dataset with the name provided with this Index object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the index

  • TypeError – Raised if the Index levels are a list.

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

arkouda.NAN: float
arkouda.NINF: float
class arkouda.NUMBER_FORMAT_STRINGS

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

clear(*args, **kwargs)

D.clear() -> None. Remove all items from D.

copy(*args, **kwargs)

D.copy() -> a shallow copy of D

fromkeys(iterable, value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items(*args, **kwargs)

D.items() -> a set-like object providing a view on D’s items

keys(*args, **kwargs)

D.keys() -> a set-like object providing a view on D’s keys

pop(*args, **kwargs)

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If key is not found, default is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(*args, **kwargs)

D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(*args, **kwargs)

D.values() -> an object providing a view on D’s values

arkouda.NZERO: float
arkouda.NaN: float
exception arkouda.NonUniqueError[source]

Bases: ValueError

Inappropriate argument value (of correct type).

class arkouda.NumericDTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.ObjectDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

arkouda.PINF: float
arkouda.PZERO: float
class arkouda.Power_divergenceResult[source]

Bases: Power_divergenceResult

The results of a power divergence statistical test.

statistic
Type:

numpy.float64

pvalue
Type:

numpy.float64

class arkouda.Properties[source]
class arkouda.RankWarning

Issued by polyfit when the Vandermonde matrix is rank deficient.

For more information, a way to suppress the warning, and an example of RankWarning being issued, see polyfit.

arkouda.RegisteredSymbols = '__RegisteredSymbols__'
exception arkouda.RegistrationError[source]

Bases: Exception

Error/Exception used when the Arkouda Server cannot register an object

exception arkouda.RegistrationError[source]

Bases: Exception

Error/Exception used when the Arkouda Server cannot register an object

exception arkouda.RegistrationError[source]

Bases: Exception

Error/Exception used when the Arkouda Server cannot register an object

exception arkouda.RegistrationError[source]

Bases: Exception

Error/Exception used when the Arkouda Server cannot register an object

exception arkouda.RegistrationError[source]

Bases: Exception

Error/Exception used when the Arkouda Server cannot register an object

class arkouda.Row(dict=None, /, **kwargs)[source]

Bases: collections.UserDict

This class is useful for printing and working with individual rows of a of an aku.DataFrame.

arkouda.SEG_SUFFIX = '_segments'
class arkouda.ScalarDTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.ScalarType

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

class arkouda.SegArray(segments, values, lengths=None, grouping=None)[source]
AND(x=None)[source]
OR(x=None)[source]
XOR(x=None)[source]
aggregate(op, x=None)[source]
all(x=None)[source]
any(x=None)[source]
append(other, axis=0)[source]

Append other to self, either vertically (axis=0, length of resulting SegArray increases), or horizontally (axis=1, each sub-array of other appends to the corresponding sub-array of self).

Parameters:
  • other (SegArray) – Array of sub-arrays to append

  • axis (0 or 1) – Whether to append vertically (0) or horizontally (1). If axis=1, other must be same size as self.

Returns:

axis=0: New SegArray containing all sub-arrays axis=1: New SegArray of same length, with pairs of sub-arrays concatenated

Return type:

SegArray

append_single(x, prepend=False)[source]

Append a single value to each sub-array.

Parameters:

x (pdarray or scalar) – Single value to append to each sub-array

Returns:

Copy of original SegArray with values from x appended to each sub-array

Return type:

SegArray

argmax(x=None)[source]
argmin(x=None)[source]
classmethod attach(user_defined_name)[source]

Using the defined name, attach to a SegArray that has been registered to the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Returns:

The resulting SegArray

Return type:

SegArray

Raises:

RuntimeError – Raised if the server could not attach to the SegArray object

classmethod concat(x, axis=0, ordered=True)[source]

Concatenate a sequence of SegArrays

Parameters:
  • x (sequence of SegArray) – The SegArrays to concatenate

  • axis (0 or 1) – Select vertical (0) or horizontal (1) concatenation. If axis=1, all SegArrays must have same size.

  • ordered (bool) – Must be True. This option is present for compatibility only, because unordered concatenation is not yet supported.

Returns:

The input arrays joined into one SegArray

Return type:

SegArray

copy()[source]

Return a deep copy.

dtype
filter(filter, discard_empty: bool = False)[source]

Filter values out of the SegArray object

Parameters:
  • filter (pdarray, list, or value) – The value/s to be filtered out of the SegArray

  • discard_empty (bool) – Defaults to False. When True, empty segments are removed from the return SegArray

Return type:

SegArray

classmethod from_multi_array(m)[source]

Construct a SegArray from a list of columns. This essentially transposes the input, resulting in an array of rows.

Parameters:

m (list of pdarray or Strings) – List of columns, the rows of which will form the sub-arrays of the output

Returns:

Array of rows of input

Return type:

SegArray

classmethod from_parts(segments, values, lengths=None, grouping=None) SegArray[source]

DEPRECATED Construct a SegArray object from its parts

Parameters:
  • segments (pdarray, int64) – Start index of each sub-array in the flattened values array

  • values (pdarray) – The flattened values of all sub-arrays

  • lengths (pdarray) – The length of each segment

  • grouping (GroupBy) – grouping of segments

Returns:

Data structure representing an array whose elements are variable-length arrays.

Return type:

SegArray

Notes

Keyword args ‘lengths’ and ‘grouping’ are not user-facing. They are used by the attach method.

classmethod from_return_msg(rep_msg) SegArray[source]
get_jth(j, return_origins=True, compressed=False, default=0)[source]

Select the j-th element of each sub-array, where possible.

Parameters:
  • j (int) – The index of the value to get from each sub-array. If j is negative, it counts backwards from the end of each sub-array.

  • return_origins (bool) – If True, return a logical index indicating where j is in bounds

  • compressed (bool) – If False, return array is same size as self, with default value where j is out of bounds. If True, the return array only contains values where j is in bounds.

  • default (scalar) – When compressed=False, the value to return when j is out of bounds for the sub-array

Returns:

  • val (pdarray) – compressed=False: The j-th value of each sub-array where j is in bounds and the default value where j is out of bounds. compressed=True: The j-th values of only the sub-arrays where j is in bounds

  • origin_indices (pdarray, bool) – A Boolean array that is True where j is in bounds for the sub-array.

Notes

If values are Strings, only the compressed format is supported.

get_length_n(n, return_origins=True)[source]

Return all sub-arrays of length n, as a list of columns.

Parameters:
  • n (int) – Length of sub-arrays to select

  • return_origins (bool) – Return a logical index indicating which sub-arrays are length n

Returns:

  • columns (list of pdarray) – An n-long list of pdarray, where each row is one of the n-long sub-arrays from the SegArray. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Array of bool for each element of the SegArray, True where sub-array has length n.

get_ngrams(n, return_origins=True)[source]

Return all n-grams from all sub-arrays.

Parameters:
  • n (int) – Length of n-gram

  • return_origins (bool) – If True, return an int64 array indicating which sub-array each returned n-gram came from.

Returns:

  • ngrams (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-gram.

  • origin_indices (pdarray, int) – The index of the sub-array from which the corresponding n-gram originated

get_prefixes(n, return_origins=True, proper=True)[source]

Return all sub-array prefixes of length n (for sub-arrays that are at least n+1 long)

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a prefix.

Returns:

  • prefixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-prefix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

get_suffixes(n, return_origins=True, proper=True)[source]

Return the n-long suffix of each sub-array, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which sub-arrays were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from sub-arrays that are at least n+1 long. If False, allow the entire sub-array to be returned as a suffix.

Returns:

  • suffixes (list of pdarray) – An n-long list of pdarrays, essentially a table where each row is an n-suffix. The number of rows is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the sub-array was long enough to return an n-suffix, False otherwise.

property grouping
hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each segment.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

intersect(other)[source]

Computes the intersection of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d intersections of the segments of self and other

Return type:

SegArray

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.intersect(seg_b)
SegArray([
[1, 3],
[4]
])
is_registered() bool[source]

Checks if the name of the SegArray object is registered in the Symbol Table

Returns:

True if SegArray is registered, false if not

Return type:

bool

classmethod load(prefix_path, dataset='segarray', segment_name='segments', value_name='values')[source]
logger
max(x=None)[source]
mean(x=None)[source]
min(x=None)[source]
property nbytes

The size of the segarray in bytes.

Returns:

The size of the segarray in bytes.

Return type:

int

property non_empty
nunique(x=None)[source]
objType = 'SegArray'
prepend_single(x)[source]
prod(x=None)[source]
classmethod read_hdf(prefix_path, dataset='segarray')[source]

Load a saved SegArray from HDF5. All arguments must match what was supplied to SegArray.save()

Parameters:
  • prefix_path (str) – Directory and filename prefix

  • dataset (str) – Name prefix for saved data within the HDF5 files

Return type:

SegArray

register(user_defined_name)[source]

Register this SegArray object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name which this SegArray object will be registered under

Returns:

The same SegArray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different SegArrays with the same name.

Return type:

SegArray

Raises:

RegistrationError – Raised if the server could not register the SegArray object

Notes

Objects registered with the server are immune to deletion until they are unregistered.

registered_name: str | None = None
remove_repeats(return_multiplicity=False)[source]

Condense sequences of repeated values within a sub-array to a single value.

Parameters:

return_multiplicity (bool) – If True, also return the number of times each value was repeated.

Returns:

  • norepeats (SegArray) – Sub-arrays with runs of repeated values replaced with single value

  • multiplicity (SegArray) – If return_multiplicity=True, this array contains the number of times each value in the returned SegArray was repeated in the original SegArray.

save(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

DEPRECATED Save the SegArray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf, load

segments
set_jth(i, j, v)[source]

Set the j-th element of each sub-array in a subset.

Parameters:
  • i (pdarray, int) – Indices of sub-arrays to set j-th element

  • j (int) – Index of value to set in each sub-array. If j is negative, it counts backwards from the end of the sub-array.

  • v (pdarray or scalar) – The value(s) to set. If v is a pdarray, it must have same length as i.

Raises:

ValueError – If j is out of bounds in any of the sub-arrays specified by i.

setdiff(other)[source]

Computes the set difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d set difference of the segments of self and other

Return type:

SegArray

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setdiff(seg_b)
SegArray([
[2, 4],
[1, 3, 5]
])
setxor(other)[source]

Computes the symmetric difference of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d symmetric difference of the segments of self and other

Return type:

SegArray

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.setxor(seg_b)
SegArray([
[2, 4, 5],
[1, 3, 5, 2]
])
size
sum(x=None)[source]
to_hdf(prefix_path, dataset='segarray', mode='truncate', file_type='distribute')[source]

Save the SegArray to HDF5. The result is a collection of HDF5 files, one file per locale of the arkouda server, where each filename starts with prefix_path.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files will share

  • dataset (str) – Name prefix for saved data within the HDF5 file

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, add data as a new column to existing files.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

None

See also

load

to_list()[source]

Convert the segarray into a list containing sub-arrays

Returns:

A list with the same sub-arrays (also list) as this segarray

Return type:

list

See also

to_ndarray

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_list()
[[0, 1, 2, 3], [4, 5, 6], [7, 8, 9, 10, 11]]
>>> type(segarr.to_list())
list
to_ndarray()[source]

Convert the array into a numpy.ndarray containing sub-arrays

Returns:

A numpy ndarray with the same sub-arrays (also numpy.ndarray) as this array

Return type:

np.ndarray

See also

array, to_list

Examples

>>> segarr = ak.SegArray(ak.array([0, 4, 7]), ak.arange(12))
>>> segarr.to_ndarray()
array([array([1, 2, 3, 4]), array([5, 6, 7]), array([8, 9, 10, 11, 12])])
>>> type(segarr.to_ndarray())
numpy.ndarray
to_parquet(prefix_path, dataset='segarray', mode: str = 'truncate', compression: str | None = None)[source]

Save the SegArray object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the object to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: Deprecated.

Parameter kept to maintain functionality of other calls. Only Truncate supported. By default, truncate (overwrite) output files, if they exist. If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – If write mode is not Truncate.

Notes

  • Append mode for Parquet has been deprecated. It was not implemented for SegArray.

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Segmented Array to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Segmented Array is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

union(other)[source]

Computes the union of 2 SegArrays.

Parameters:

other (SegArray) – SegArray to compute against

Returns:

Segments are the 1d union of the segments of self and other

Return type:

SegArray

Examples

>>> a = [1, 2, 3, 1, 4]
>>> b = [3, 1, 4, 5]
>>> c = [1, 3, 3, 5]
>>> d = [2, 2, 4]
>>> seg_a = ak.segarray(ak.array([0, len(a)]), ak.array(a+b))
>>> seg_b = ak.segarray(ak.array([0, len(c)]), ak.array(c+d))
>>> seg_a.union(seg_b)
SegArray([
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5]
])
unique(x=None)[source]

Return sub-arrays of unique values.

Parameters:

x (pdarray) – The values to unique, per group. By default, the values of this SegArray’s sub-arrays.

Returns:

Same number of sub-arrays as original SegArray, but elements in sub-array are unique and in sorted order.

Return type:

SegArray

unregister()[source]

Unregister this SegArray object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

Notes

Objects registered with the server are immune to deletion until they are unregistered.

static unregister_segarray_by_name(user_defined_name)[source]

Using the defined name, remove the registered SegArray object from the Symbol Table

Parameters:

user_defined_name (str) – user defined name which the SegArray object was registered under

Return type:

None

Raises:

RuntimeError – Raised if the server could not unregister the SegArray object from the Symbol Table

update_hdf(prefix_path: str, dataset: str = 'segarray', repack: bool = True)[source]

Overwrite the dataset with the name provided with this SegArray object. If the dataset does not exist it is added.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown saving the SegArray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

valsize
values
class arkouda.Series[source]

One-dimensional arkouda array with axis labels.

Parameters:
  • index (pdarray, Strings) – an array of indices associated with the data array. If empty, it will default to a range of ints whose size match the size of the data. optional

  • data (Tuple, List, groupable_element_type, Series, SegArray) – a 1D array. Must not be None.

Raises:
  • TypeError – Raised if index is not a pdarray or Strings object Raised if data is not a pdarray, Strings, or Categorical object

  • ValueError – Raised if the index size does not match data size

Notes

The Series class accepts either positional arguments or keyword arguments. If entering positional arguments,

2 arguments entered:

argument 1 - data argument 2 - index

1 argument entered:

argument 1 - data

If entering 1 positional argument, it is assumed that this is the data argument. If only ‘data’ argument is passed in, Index will automatically be generated. If entering keywords,

‘data’ (see Parameters) ‘index’ (optional) must match size of ‘data’

add(b: Series) Series[source]
argmax()
argmin()
property at

Accesses entries of a Series by label

Parameters:

key (pdarray, Strings, Series, list, supported_scalars) – The key or container of keys to access entries for

attach(label: str, nkeys: int = 1) Series[source]

DEPRECATED Retrieve a series registered with arkouda

Parameters:
  • label (name used to register the series)

  • nkeys (number of keys, if a multi-index was registerd)

concat(arrays: List, axis: int = 0, index_labels: List[str] | None = None, value_labels: List[str] | None = None, ordered=False) arkouda.dataframe.DataFrame | Series[source]
Concatenate in arkouda a list of arkouda Series or grouped arkouda arrays horizontally or

vertically. If a list of grouped arkouda arrays is passed they are converted to a series. Each grouping is a 2-tuple with the first item being the key(s) and the second being the value. If horizontal, each series or grouping must have the same length and the same index. The index of the series is converted to a column in the dataframe. If it is a multi-index,each level is converted to a column.

arrays: The list of series/groupings to concat. axis : Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation index_labels: column names(s) to label the index. value_labels: column names to label values of each series. ordered: If True (default), the arrays will be appended in the order given. If False, array

data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

axis=0: an arkouda series. axis=1: an arkouda dataframe.

diff() Series[source]

Diffs consecutive values of the series.

Returns a new series with the same index and length. First value is set to NaN.

dt(series)
property dtype
fillna(value) Series[source]

Fill NA/NaN values using the specified method.

Parameters:

value (scalar, Series, or pdarray) – Value to use to fill holes (e.g. 0), alternately a Series of values specifying which value to use for each index. Values not in the Series will not be filled. This value cannot be a list.

Returns:

Object with missing values filled.

Return type:

Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> data = ak.Series([1, np.nan, 3, np.nan, 5])
>>> data

0

0

1

1

nan

2

3

3

nan

4

5

>>> fill_values1 = ak.ones(5)
>>> data.fillna(fill_values1)

0

0

1

1

1

2

3

3

1

4

5

>>> fill_values2 = Series(ak.ones(5))
>>> data.fillna(fill_values2)

0

0

1

1

1

2

3

3

1

4

5

>>> fill_values3 = 100.0
>>> data.fillna(fill_values3)

0

0

1

1

100

2

3

3

100

4

5

from_return_msg(repMsg: str) Series[source]

Return a Series instance pointing to components created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) –

  • delimited string containing the values and indexes

Returns:

A Series representing a set of pdarray components on the server

Return type:

Series

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of creating the Series instance

has_repeat_labels() bool[source]

Returns whether the Series has any labels that appear more than once

hasnans() bool_scalars[source]

Return True if there are any NaNs.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> import numpy as np
>>> s = ak.Series(ak.array([1, 2, 3, np.nan]))
>>> s
>>> s.hasnans
True
head(n: int = 10) Series[source]

Return the first n values of the series

property iat: Series

Accesses entries of a Series by position

Parameters:

key (int) – The positions or container of positions to access entries for

property iloc: Series

Accesses entries of a Series by position

Parameters:

key (int) – The positions or container of positions to access entries for

is_registered() bool[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

isin(lst: pdarray | Strings | List) Series[source]

Find series elements whose values are in the specified list

Either a python list or an arkouda array.

Arkouda boolean which is true for elements that are in the list and false otherwise.

isna() Series[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.isna()

0

1

False

2

False

4

True

isnull() Series[source]

Series.isnull is an alias for Series.isna.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings ‘’ are not considered NA values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.isnull()

0

1

False

2

False

4

True

property loc: Series

Accesses entries of a Series by label

Parameters:

key (pdarray, Strings, Series, list, supported_scalars) – The key or container of keys to access entries for

locate(key: int | pdarray | Index | Series | List | Tuple) Series[source]

Lookup values by index label

The input can be a scalar, a list of scalers, or a list of lists (if the series has a MultiIndex). As a special case, if a Series is used as the key, the series labels are preserved with its values use as the key.

Keys will be turned into arkouda arrays as needed.

A Series containing the values corresponding to the key.

map(arg: dict | Series) Series[source]

Map values of Series according to an input mapping.

Parameters:

arg (dict or Series) – The mapping correspondence.

Returns:

A new series with the same index as the caller. When the input Series has Categorical values, the return Series will have Strings values. Otherwise, the return type will match the input type.

Return type:

arkouda.series.Series

Raises:

TypeError – Raised if arg is not of type dict or arkouda.Series. Raised if series values not of type pdarray, Categorical, or Strings.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.Series(ak.array([2, 3, 2, 3, 4]))
>>> display(s)

0

0

2

1

3

2

2

3

3

4

4

>>> s.map({4: 25.0, 2: 30.0, 1: 7.0, 3: 5.0})

0

0

30.0

1

5.0

2

30.0

3

5.0

4

25.0

>>> s2 = ak.Series(ak.array(["a","b","c","d"]), index = ak.array([4,2,1,3]))
>>> s.map(s2)

0

0

b

1

b

2

d

3

d

4

a

max()
mean()
memory_usage(index: bool = True, unit='B') int[source]

Return the memory usage of the Series.

The memory usage can optionally include the contribution of the index.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the Series index.

  • unit (str, default = "B") – Unit to return. One of {‘B’, ‘KB’, ‘MB’, ‘GB’}.

Returns:

Bytes of memory consumed.

Return type:

int

Examples

>>> from arkouda.series import Series
>>> s = ak.Series(ak.arange(3))
>>> s.memory_usage()
48

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)
24

Select the units:

>>> s = ak.Series(ak.arange(3000))
>>> s.memory_usage(unit="KB")
46.875
min()
property ndim
notna() Series[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.notna()

0

1

True

2

True

4

False

notnull() Series[source]

Series.notnull is an alias for Series.notna.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ are not considered NA values. NA values, such as numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

arkouda.series.Series

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import Series
>>> import numpy as np
>>> s = Series(ak.array([1, 2, np.nan]), index = ak.array([1, 2, 4]))
>>> s.notnull()

0

1

True

2

True

4

False

objType(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

pdconcat(arrays: List, axis: int = 0, labels: Strings | None = None) pd.Series | pd.DataFrame[source]

Concatenate a list of arkouda Series or grouped arkouda arrays, returning a PANDAS object.

If a list of grouped arkouda arrays is passed they are converted to a series. Each grouping is a 2-tuple with the first item being the key(s) and the second being the value.

If horizontal, each series or grouping must have the same length and the same index. The index of the series is converted to a column in the dataframe. If it is a multi-index,each level is converted to a column.

arrays: The list of series/groupings to concat. axis : Whether or not to do a verticle (axis=0) or horizontal (axis=1) concatenation labels: names to give the columns of the data frame.

axis=0: a local PANDAS series axis=1: a local PANDAS dataframe

prod()
register(user_defined_name: str)[source]

Register this Series object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the Series is to be registered under, this will be the root name for underlying components

Returns:

The same Series which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Series with the same name.

Return type:

Series

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Series with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property shape
sort_index(ascending: bool = True) Series[source]

Sort the series by its index

ascendingbool

Sort values in ascending (default) or descending order.

A new Series sorted.

sort_values(ascending: bool = True) Series[source]

Sort the series numerically

ascendingbool

Sort values in ascending (default) or descending order.

A new Series sorted smallest to largest

std()
str_acc(series)
sum()
tail(n: int = 10) Series[source]

Return the last n values of the series

to_dataframe(index_labels: List[str] | None = None, value_label: str | None = None) arkouda.dataframe.DataFrame[source]

Converts series to an arkouda data frame

index_labels: column names(s) to label the index. value_label: column name to label values.

An arkouda dataframe.

to_list() list[source]
to_markdown(mode='wt', index=True, tablefmt='grid', storage_options=None, **kwargs)[source]

Print Series in Markdown-friendly format.

Parameters:
  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) – Add index (row) labels.

  • tablefmt (str = "grid") – Table format to call from tablulate: https://pypi.org/project/tabulate/

  • storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc., if using a URL that will be parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set of allowed keys and values.

  • **kwargs – These parameters will be passed to tabulate.

Note

This function should only be called on small Series as it calls pandas.Series.to_markdown: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_markdown.html

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.Series(["elk", "pig", "dog", "quetzal"], name="animal")
>>> print(s.to_markdown())
|    | animal   |
|---:|:---------|
|  0 | elk      |
|  1 | pig      |
|  2 | dog      |
|  3 | quetzal  |

Output markdown with a tabulate option.

>>> print(s.to_markdown(tablefmt="grid"))
+----+----------+
|    | animal   |
+====+==========+
|  0 | elk      |
+----+----------+
|  1 | pig      |
+----+----------+
|  2 | dog      |
+----+----------+
|  3 | quetzal  |
+----+----------+
to_ndarray()[source]
to_pandas() pd.Series[source]

Convert the series to a local PANDAS series

topn(n: int = 10) Series[source]

Return the top values of the series

n: Number of values to return

A new Series with the top values

unregister()[source]

Unregister this Series object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

validate_key(key: Series | pdarray | Strings | Categorical | List | supported_scalars | SegArray) pdarray | Strings | Categorical | supported_scalars | SegArray[source]

Validates type requirements for keys when reading or writing the Series. Also converts list and tuple arguments into pdarrays.

Parameters:

key (Series, pdarray, Strings, Categorical, List, supported_scalars) – The key or container of keys that might be used to index into the Series.

Return type:

The validated key(s), with lists and tuples converted to pdarrays

Raises:
  • TypeError – Raised if keys are not boolean values or the type of the labels Raised if key is not one of the supported types

  • KeyError – Raised if container of keys has keys not present in the Series

  • IndexError – Raised if the length of a boolean key array is different from the Series

validate_val(val: pdarray | Strings | supported_scalars | List) pdarray | Strings | supported_scalars[source]

Validates type requirements for values being written into the Series. Also converts list and tuple arguments into pdarrays.

Parameters:

val (pdarray, Strings, list, supported_scalars) – The value or container of values that might be assigned into the Series.

Return type:

The validated value, with lists converted to pdarrays

Raises:

TypeError

Raised if val is not the same type or a container with elements

of the same time as the Series

Raised if val is a string or Strings type. Raised if val is not one of the supported types

value_counts(sort: bool = True) Series[source]

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element.

sort : Boolean. Whether or not to sort the results. Default is true.

var()
class arkouda.SeriesDTypes

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

clear(*args, **kwargs)

D.clear() -> None. Remove all items from D.

copy(*args, **kwargs)

D.copy() -> a shallow copy of D

fromkeys(iterable, value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items(*args, **kwargs)

D.items() -> a set-like object providing a view on D’s items

keys(*args, **kwargs)

D.keys() -> a set-like object providing a view on D’s keys

pop(*args, **kwargs)

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If key is not found, default is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(*args, **kwargs)

D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(*args, **kwargs)

D.values() -> an object providing a view on D’s values

class arkouda.ShortDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

arkouda.SortingAlgorithm
class arkouda.StrDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.StringAccessor(series)[source]

Bases: Properties

series
class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype) arkouda.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',
... 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',
... 'Strings are here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True, True, True, True, True])
>>> strings.contains('string \d', regex=True)
array([True, True, True, True, True])
decode(fromEncoding, toEncoding='UTF-8')[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str) – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The suffix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True, True, True, True, True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True, True, True, True, True])
entry: arkouda.pdarrayclass.pdarray
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2, 2, 2, 2, 2])
>>> starts
array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9])
>>> lens
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (str_scalars) – Regex used to find matches

  • return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (Union[pdarray, str]) – the array containing the offsets

  • bytes_attrib (Union[pdarray, str]) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes()[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets()[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int) – Length of prefix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', ''])
>>> strings.isempty()
islower() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘

’, ‘ ’, ‘ ’, ‘ ’).

pdarray, bool

True for elements that are whitespace, False otherwise

RuntimeError

Raised if there is a server-side error thrown

Strings.islower Strings.isupper Strings.istitle

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '    ', '

‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘

‘])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ',
... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (Union[bytes,str_scalars]) – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (str) – Regex used to split strings into substrings

  • maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
# Compared against peel
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Strings dataset within existing files.

Parameters:
  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.flatten('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, map = orig.flatten('|', return_segments=True)
>>> map
array([0, 2, 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True)
>>> under_flat
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0, 2, 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The prefix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True, True, True, True, True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True, True, True, True, True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (str) – String inserted between self and other

  • toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – The name of the Strings dataset to be written, defaults to strings_array

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype) arkouda.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',
... 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',
... 'Strings are here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True, True, True, True, True])
>>> strings.contains('string \d', regex=True)
array([True, True, True, True, True])
decode(fromEncoding, toEncoding='UTF-8')[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str) – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The suffix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True, True, True, True, True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True, True, True, True, True])
entry: arkouda.pdarrayclass.pdarray
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2, 2, 2, 2, 2])
>>> starts
array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9])
>>> lens
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (str_scalars) – Regex used to find matches

  • return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (Union[pdarray, str]) – the array containing the offsets

  • bytes_attrib (Union[pdarray, str]) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes()[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets()[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int) – Length of prefix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', ''])
>>> strings.isempty()
islower() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘

’, ‘ ’, ‘ ’, ‘ ’).

pdarray, bool

True for elements that are whitespace, False otherwise

RuntimeError

Raised if there is a server-side error thrown

Strings.islower Strings.isupper Strings.istitle

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '    ', '

‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘

‘])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ',
... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (Union[bytes,str_scalars]) – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (str) – Regex used to split strings into substrings

  • maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
# Compared against peel
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Strings dataset within existing files.

Parameters:
  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.flatten('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, map = orig.flatten('|', return_segments=True)
>>> map
array([0, 2, 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True)
>>> under_flat
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0, 2, 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The prefix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True, True, True, True, True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True, True, True, True, True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (str) – String inserted between self and other

  • toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – The name of the Strings dataset to be written, defaults to strings_array

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype) arkouda.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',
... 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',
... 'Strings are here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True, True, True, True, True])
>>> strings.contains('string \d', regex=True)
array([True, True, True, True, True])
decode(fromEncoding, toEncoding='UTF-8')[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str) – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The suffix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True, True, True, True, True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True, True, True, True, True])
entry: arkouda.pdarrayclass.pdarray
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2, 2, 2, 2, 2])
>>> starts
array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9])
>>> lens
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (str_scalars) – Regex used to find matches

  • return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (Union[pdarray, str]) – the array containing the offsets

  • bytes_attrib (Union[pdarray, str]) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes()[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets()[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int) – Length of prefix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', ''])
>>> strings.isempty()
islower() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘

’, ‘ ’, ‘ ’, ‘ ’).

pdarray, bool

True for elements that are whitespace, False otherwise

RuntimeError

Raised if there is a server-side error thrown

Strings.islower Strings.isupper Strings.istitle

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '    ', '

‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘

‘])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ',
... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (Union[bytes,str_scalars]) – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (str) – Regex used to split strings into substrings

  • maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
# Compared against peel
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Strings dataset within existing files.

Parameters:
  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.flatten('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, map = orig.flatten('|', return_segments=True)
>>> map
array([0, 2, 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True)
>>> under_flat
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0, 2, 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The prefix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True, True, True, True, True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True, True, True, True, True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (str) – String inserted between self and other

  • toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – The name of the Strings dataset to be written, defaults to strings_array

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype) arkouda.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',
... 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',
... 'Strings are here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True, True, True, True, True])
>>> strings.contains('string \d', regex=True)
array([True, True, True, True, True])
decode(fromEncoding, toEncoding='UTF-8')[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str) – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The suffix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True, True, True, True, True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True, True, True, True, True])
entry: arkouda.pdarrayclass.pdarray
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2, 2, 2, 2, 2])
>>> starts
array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9])
>>> lens
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (str_scalars) – Regex used to find matches

  • return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (Union[pdarray, str]) – the array containing the offsets

  • bytes_attrib (Union[pdarray, str]) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes()[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets()[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int) – Length of prefix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', ''])
>>> strings.isempty()
islower() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘

’, ‘ ’, ‘ ’, ‘ ’).

pdarray, bool

True for elements that are whitespace, False otherwise

RuntimeError

Raised if there is a server-side error thrown

Strings.islower Strings.isupper Strings.istitle

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '    ', '

‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘

‘])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ',
... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (Union[bytes,str_scalars]) – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (str) – Regex used to split strings into substrings

  • maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
# Compared against peel
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Strings dataset within existing files.

Parameters:
  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.flatten('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, map = orig.flatten('|', return_segments=True)
>>> map
array([0, 2, 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True)
>>> under_flat
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0, 2, 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The prefix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True, True, True, True, True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True, True, True, True, True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (str) – String inserted between self and other

  • toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – The name of the Strings dataset to be written, defaults to strings_array

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.Strings(strings_pdarray: arkouda.pdarrayclass.pdarray, bytes_size: arkouda.numpy.dtypes.int_scalars)[source]

Represents an array of strings whose data resides on the arkouda server. The user should not call this class directly; rather its instances are created by other arkouda functions.

entry

Encapsulation of a Segmented Strings array contained on the arkouda server. This is a composite of

  • offsets array: starting indices for each string

  • bytes array: raw bytes of all strings joined by nulls

Type:

pdarray

size

The number of strings in the array

Type:

int_scalars

nbytes

The total number of bytes in all strings

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

The sizes of each dimension of the array

Type:

tuple

dtype

The dtype is ak.str

Type:

dtype

logger

Used for all logging operations

Type:

ArkoudaLogger

Notes

Strings is composed of two pdarrays: (1) offsets, which contains the starting indices for each string and (2) bytes, which contains the raw bytes of all strings, delimited by nulls.

BinOps
astype(dtype) arkouda.pdarrayclass.pdarray[source]

Cast values of Strings object to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) Strings[source]

class method to return a Strings object attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which the Strings object was registered under

Returns:

the Strings object registered with user_defined_name in the arkouda server

Return type:

Strings object

Raises:

TypeError – Raised if user_defined_name is not a str

See also

register, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

cached_regex_patterns() List[source]

Returns the regex patterns for which Match objects have been cached

capitalize() Strings[source]

Returns a new Strings from the original replaced with the first letter capitilzed and the remaining letters lowercase.

Returns:

Strings from the original replaced with the capitalized equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper, String.title

Examples

>>> strings = ak.array([f'StrINgS aRe Here {i}' for i in range(5)])
>>> strings
array(['StrINgS aRe Here 0', 'StrINgS aRe Here 1', 'StrINgS aRe Here 2', 'StrINgS aRe Here 3',
... 'StrINgS aRe Here 4'])
>>> strings.title()
array(['Strings are here 0', 'Strings are here 1', 'Strings are here 2', 'Strings are here 3',
... 'Strings are here 4'])
contains(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element contains the given substring.

Parameters:
  • substr (str_scalars) – The substring in the form of string or byte array to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that contain substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> strings
array(['1 string 1', '2 string 2', '3 string 3', '4 string 4', '5 string 5'])
>>> strings.contains('string')
array([True, True, True, True, True])
>>> strings.contains('string \d', regex=True)
array([True, True, True, True, True])
decode(fromEncoding, toEncoding='UTF-8')[source]

Return a new strings object in fromEncoding, expecting that the current Strings is encoded in toEncoding

Parameters:
  • fromEncoding (str) – The current encoding of the strings object

  • toEncoding (str) – The encoding that the strings will be converted to, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

dtype
encode(toEncoding: str, fromEncoding: str = 'UTF-8')[source]

Return a new strings object in toEncoding, expecting that the current Strings is encoded in fromEncoding

Parameters:
  • toEncoding (str) – The encoding that the strings will be converted to

  • fromEncoding (str) – The current encoding of the strings object, default to UTF-8

Returns:

A new Strings object in toEncoding

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

endswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element ends with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The suffix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that end with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not bytes or str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.endswith('ing')
array([True, True, True, True, True])
>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.endswith('ing \d', regex = True)
array([True, True, True, True, True])
entry: arkouda.pdarrayclass.pdarray
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether Strings are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the Strings are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> s = ak.array(["a", "b", "c"])
>>> s_cpy = ak.array(["a", "b", "c"])
>>> s.equals(s_cpy)
True
>>> s2 = ak.array(["a", "x", "c"])
>>> s.equals(s2)
False
find_locations(pattern: bytes | arkouda.numpy.dtypes.str_scalars) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Finds pattern matches and returns pdarrays containing the number, start postitions, and lengths of matches

Parameters:

pattern (str_scalars) – The regex pattern used to find matches

Returns:

  • pdarray, int64 – For each original string, the number of pattern matches

  • pdarray, int64 – The start positons of pattern matches

  • pdarray, int64 – The lengths of pattern matches

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array([f'{i} string {i}' for i in range(1, 6)])
>>> num_matches, starts, lens = strings.find_locations('\d')
>>> num_matches
array([2, 2, 2, 2, 2])
>>> starts
array([0, 9, 0, 9, 0, 9, 0, 9, 0, 9])
>>> lens
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]))
findall(pattern: bytes | arkouda.numpy.dtypes.str_scalars, return_match_origins: bool = False) Strings | Tuple[source]

Return a new Strings containg all non-overlapping matches of pattern

Parameters:
  • pattern (str_scalars) – Regex used to find matches

  • return_match_origins (bool) – If True, return a pdarray containing the index of the original string each pattern match is from

Returns:

  • Strings – Strings object containing only pattern matches

  • pdarray, int64 (optional) – The index of the original string each pattern match is from

Raises:
  • TypeError – Raised if the pattern parameter is not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.findall('_+', return_match_origins=True)
(array(['_', '___', '____', '__', '___', '____', '___']), array([0 0 1 3 3 3 3]))
flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

Note

As multidimensional Strings are currently supported, flatten on a Strings object will always return itself.

static from_parts(offset_attrib: arkouda.pdarrayclass.pdarray | str, bytes_attrib: arkouda.pdarrayclass.pdarray | str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response where the arrays are separate components.

Parameters:
  • offset_attrib (Union[pdarray, str]) – the array containing the offsets

  • bytes_attrib (Union[pdarray, str]) – the array containing the string values

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

This factory method is used when we construct the parts of a Strings object on the client side and transfer the offsets & bytes separately to the server. This results in two entries in the symbol table and we need to instruct the server to assemble the into a composite entity.

static from_return_msg(rep_msg: str) Strings[source]

Factory method for creating a Strings object from an Arkouda server response message

Parameters:

rep_msg (str) – Server response message currently of form created name type size ndim shape itemsize+created bytes.size 1234

Returns:

object representing a segmented strings array on the server

Return type:

Strings

Raises:

RuntimeError – Raised if there’s an error converting a server-returned str-descriptor

Notes

We really don’t have an itemsize because these are variable length strings. In the future we could probably use this position to store the total bytes.

fullmatch(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the whole string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the whole string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.fullmatch('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=False; matched=False>
get_bytes()[source]

Getter for the bytes component (uint8 pdarray) of this Strings.

Returns:

Pdarray of bytes of the string accessed

Return type:

pdarray, uint8

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_bytes()
[111 110 101 0 116 119 111 0 116 104 114 101 101 0]
get_lengths() arkouda.pdarrayclass.pdarray[source]

Return the length of each string in the array.

Returns:

The length of each string

Return type:

pdarray, int

Raises:

RuntimeError – Raised if there is a server-side error thrown

get_offsets()[source]

Getter for the offsets component (int64 pdarray) of this Strings.

Returns:

Pdarray of offsets of the string accessed

Return type:

pdarray, int64

Example

>>> x = ak.array(['one', 'two', 'three'])
>>> x.get_offsets()
[0 4 8]
get_prefixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long prefix of each string, where possible

Parameters:
  • n (int) – Length of prefix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-prefix

  • proper (bool) – If True, only return proper prefixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a prefix.

Returns:

  • prefixes (Strings) – The array of n-character prefixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character prefix, False otherwise.

get_suffixes(n: arkouda.numpy.dtypes.int_scalars, return_origins: bool = True, proper: bool = True) Strings | Tuple[Strings, arkouda.pdarrayclass.pdarray][source]

Return the n-long suffix of each string, where possible

Parameters:
  • n (int) – Length of suffix

  • return_origins (bool) – If True, return a logical index indicating which strings were long enough to return an n-suffix

  • proper (bool) – If True, only return proper suffixes, i.e. from strings that are at least n+1 long. If False, allow the entire string to be returned as a suffix.

Returns:

  • suffixes (Strings) – The array of n-character suffixes; the number of elements is the number of True values in the returned mask.

  • origin_indices (pdarray, bool) – Boolean array that is True where the string was long enough to return an n-character suffix, False otherwise.

group() arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the array, placing equivalent strings together. All instances of the same string are guaranteed to lie in one contiguous block of the permuted array, but the blocks are not necessarily ordered.

Returns:

The permutation that groups the array by value

Return type:

pdarray

See also

GroupBy, unique

Notes

If the arkouda server is compiled with “-sSegmentedString.useHash=true”, then arkouda uses 128-bit hash values to group strings, rather than sorting the strings directly. This method is fast, but the resulting permutation merely groups equivalent strings and does not sort them. If the “useHash” parameter is false, then a full sort is performed.

Raises:

RuntimeError – Raised if there is a server-side error in executing group request or creating the pdarray encapsulating the return message

hash() Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a 128-bit hash of each string.

Returns:

A tuple of two int64 pdarrays. The ith hash value is the concatenation of the ith values from each array.

Return type:

Tuple[pdarray,pdarray]

Notes

The implementation uses SipHash128, a fast and balanced hash function (used by Python for dictionaries and sets). For realistic numbers of strings (up to about 10**15), the probability of a collision between two 128-bit hash values is negligible.

property inferred_type: str

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

isalnum() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphanumeric.

Returns:

True for elements that are alphanumeric, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alnum = ak.array([f'%Strings {i}' for i in range(3)])
>>> alnum = ak.array([f'Strings{i}' for i in range(3)])
>>> strings = ak.concatenate([not_alnum, alnum])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'Strings0', 'Strings1', 'Strings2'])
>>> strings.isalnum()
array([False False False True True True])
isalpha() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is alphabetic. This means there is at least one character, and all the characters are alphabetic.

Returns:

True for elements that are alphabetic, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_alpha = ak.array([f'%Strings {i}' for i in range(3)])
>>> alpha = ak.array(['StringA','StringB','StringC'])
>>> strings = ak.concatenate([not_alpha, alpha])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', 'StringA','StringB','StringC'])
>>> strings.isalpha()
array([False False False True True True])
isdecimal() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all decimal characters.

Returns:

True for elements that are decimals, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isdigit

Examples

>>> not_decimal = ak.array([f'Strings {i}' for i in range(3)])
>>> decimal = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_decimal, decimal])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdecimal()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdecimal()
array([False True False False False])
isdigit() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings has all digit characters.

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_digit = ak.array([f'Strings {i}' for i in range(3)])
>>> digit = ak.array([f'12{i}' for i in range(3)])
>>> strings = ak.concatenate([not_digit, digit])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', '120', '121', '122'])
>>> strings.isdigit()
array([False False False True True True])
Special Character Examples
>>> special_strings = ak.array(["3.14", "0", "²", "2³₇", "2³x₇"])
>>> special_strings
array(['3.14', '0', '²', '2³₇', '2³x₇'])
>>> special_strings.isdigit()
array([False True True True False])
isempty() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is empty.

True for elements that are the empty string, False otherwise

Returns:

True for elements that are digits, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> not_empty = ak.array([f'Strings {i}' for i in range(3)])
>>> empty = ak.array(['' for i in range(3)])
>>> strings = ak.concatenate([not_empty, empty])
>>> strings
array(['%Strings 0', '%Strings 1', '%Strings 2', '', '', ''])
>>> strings.isempty()
islower() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely lowercase

Returns:

True for elements that are entirely lowercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.isupper

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.islower()
array([True True True False False False])
isspace() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i has all whitespace characters (‘ ‘, ‘ ’, ‘

’, ‘ ’, ‘ ’, ‘ ’).

pdarray, bool

True for elements that are whitespace, False otherwise

RuntimeError

Raised if there is a server-side error thrown

Strings.islower Strings.isupper Strings.istitle

>>> not_space = ak.array([f'Strings {i}' for i in range(3)])
>>> space = ak.array([' ', '    ', '

‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘

‘])
>>> strings = ak.concatenate([not_space, space])
>>> strings
array(['Strings 0', 'Strings 1', 'Strings 2', ' ',
... 'u0009', 'n', 'u000B', 'u000C', 'u000D', ' u0009nu000Bu000Cu000D'])
>>> strings.isspace()
array([False False False True True True True True True True])
istitle() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is titlecase

Returns:

True for elements that are titlecase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> mixed = ak.array([f'sTrINgs {i}' for i in range(3)])
>>> title = ak.array([f'Strings {i}' for i in range(3)])
>>> strings = ak.concatenate([mixed, title])
>>> strings
array(['sTrINgs 0', 'sTrINgs 1', 'sTrINgs 2', 'Strings 0', 'Strings 1', 'Strings 2'])
>>> strings.istitle()
array([False False False True True True])
isupper() arkouda.pdarrayclass.pdarray[source]

Returns a boolean pdarray where index i indicates whether string i of the Strings is entirely uppercase

Returns:

True for elements that are entirely uppercase, False otherwise

Return type:

pdarray, bool

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.islower

Examples

>>> lower = ak.array([f'strings {i}' for i in range(3)])
>>> upper = ak.array([f'STRINGS {i}' for i in range(3)])
>>> strings = ak.concatenate([lower, upper])
>>> strings
array(['strings 0', 'strings 1', 'strings 2', 'STRINGS 0', 'STRINGS 1', 'STRINGS 2'])
>>> strings.isupper()
array([False False False True True True])
logger
lower() Strings[source]

Returns a new Strings with all uppercase characters from the original replaced with their lowercase equivalent

Returns:

Strings with all uppercase characters from the original replaced with their lowercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.lower()
array(['strings 0', 'strings 1', 'strings 2', 'strings 3', 'strings 4'])
lstick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '') Strings[source]

Join the strings from another array onto the left of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (Union[bytes,str_scalars]) – String inserted between self and other

Returns:

The array of joined strings, as other + self

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is neither bytes nor a str or if the other parameter is not a Strings instance

  • RuntimeError – Raised if there is a server-side error thrown

See also

stick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.lstick(t, delimiter='.')
array(['b.a', 'd.c', 'f.e'])
match(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object where elements match only if the beginning of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match only if the beginning of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.match('_+')
<ak.Match object: matched=False; matched=True, span=(0, 4); matched=False;
matched=True, span=(0, 2); matched=False>
objType = 'Strings'
peel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, fromRight: bool = False, regex: bool = False) Tuple[source]

Peel off one or more delimited fields from each string (similar to string.partition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the first (times-1) delimiters

  • includeDelimiter (bool) – If true, append the delimiter to the end of the first return array. By default, it is prepended to the beginning of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the first array. By default, such strings are returned in the second array.

  • fromRight (bool) – If true, peel from the right instead of the left (see also rpeel)

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The field(s) peeled from the end of each string (unless fromRight is true)

right: Strings

The remainder of each string after peeling (unless fromRight is true)

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not byte or str_scalars, if times is not int64, or if includeDelimiter, keepPartial, or fromRight is not bool

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

rpeel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
>>> s.peel('.', includeDelimiter=True)
(array(['a.', 'c.', 'e.']), array(['b', 'd', 'f.g']))
>>> s.peel('.', times=2)
(array(['', '', 'e.f']), array(['a.b', 'c.d', 'g']))
>>> s.peel('.', times=2, keepPartial=True)
(array(['a.b', 'c.d', 'e.f']), array(['', '', 'g']))
pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

purge_cached_regex_patterns() None[source]

purges cached regex patterns

regex_split(pattern: bytes | arkouda.numpy.dtypes.str_scalars, maxsplit: int = 0, return_segments: bool = False) Strings | Tuple[source]

Returns a new Strings split by the occurrences of pattern. If maxsplit is nonzero, at most maxsplit splits occur

Parameters:
  • pattern (str) – Regex used to split strings into substrings

  • maxsplit (int) – The max number of pattern match occurences in each element to split. The default maxsplit=0 splits on all occurences

  • return_segments (bool) – If True, return mapping of original strings to first substring in return array.

Returns:

  • Strings – Substrings with pattern matches removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.split('_+', maxsplit=2, return_segments=True)
(array(['1', '2', '', '', '', '3', '', '4', '5____6___7', '']), array([0 3 5 6 9]))
register(user_defined_name: str) Strings[source]

Register this Strings object with a user defined name in the arkouda server so it can be attached to later using Strings.attach() This is an in-place operation, registering a Strings object more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one object at a time.

Parameters:

user_defined_name (str) – user defined name which the Strings object is to be registered under

Returns:

The same Strings object which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different objects with the same name.

Return type:

Strings

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the Strings object with the user_defined_name If the user is attempting to register more than one object with the same name, the former should be unregistered first to free up the registration name.

See also

attach, unregister

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

registered_name: str | None = None
rpeel(delimiter: bytes | arkouda.numpy.dtypes.str_scalars, times: arkouda.numpy.dtypes.int_scalars = 1, includeDelimiter: bool = False, keepPartial: bool = False, regex: bool = False)[source]

Peel off one or more delimited fields from the end of each string (similar to string.rpartition), returning two new arrays of strings. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • delimiter (Union[bytes, str_scalars]) – The separator where the split will occur

  • times (Union[int, np.int64]) – The number of times the delimiter is sought, i.e. skip over the last (times-1) delimiters

  • includeDelimiter (bool) – If true, prepend the delimiter to the start of the first return array. By default, it is appended to the end of the second return array.

  • keepPartial (bool) – If true, a string that does not contain <times> instances of the delimiter will be returned in the second array. By default, such strings are returned in the first array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

left: Strings

The remainder of the string after peeling

right: Strings

The field(s) that were peeled from the right of each string

Return type:

Tuple[Strings, Strings]

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if times is not int64

  • ValueError – Raised if times is < 1 or if delimiter is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

peel, stick, lstick

Examples

>>> s = ak.array(['a.b', 'c.d', 'e.f.g'])
>>> s.rpeel('.')
(array(['a', 'c', 'e.f']), array(['b', 'd', 'g']))
# Compared against peel
>>> s.peel('.')
(array(['a', 'c', 'e']), array(['b', 'd', 'f.g']))
save(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the Strings object to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: The name of the Strings dataset to be written, defaults to strings_array :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, create a new Strings dataset within existing files.

Parameters:
  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read. This is not supported for Parquet files.

  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Notes

Important implementation notes: (1) Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string, (2) the hdf5 group is named via the dataset parameter. (3) Parquet files do not store the segments, only the values.

search(pattern: bytes | arkouda.numpy.dtypes.str_scalars) arkouda.match.Match[source]

Returns a match object with the first location in each element where pattern produces a match. Elements match if any part of the string matches the regular expression pattern

Parameters:

pattern (str) – Regex used to find matches

Returns:

Match object where elements match if any part of the string matches the regular expression pattern

Return type:

Match

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.search('_+')
<ak.Match object: matched=True, span=(1, 2); matched=True, span=(0, 4);
matched=False; matched=True, span=(0, 2); matched=False>
split(delimiter: str, return_segments: bool = False, regex: bool = False) Strings | Tuple[source]

Unpack delimiter-joined substrings into a flat array.

Parameters:
  • delimiter (str) – Characters used to split strings into substrings

  • return_segments (bool) – If True, also return mapping of original strings to first substring in return array.

  • regex (bool) – Indicates whether delimiter is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

  • Strings – Flattened substrings with delimiters removed

  • pdarray, int64 (optional) – For each original string, the index of first corresponding substring in the return array

See also

peel, rpeel

Examples

>>> orig = ak.array(['one|two', 'three|four|five', 'six'])
>>> orig.flatten('|')
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> flat, map = orig.flatten('|', return_segments=True)
>>> map
array([0, 2, 5])
>>> under = ak.array(['one_two', 'three_____four____five', 'six'])
>>> under_flat, under_map = under.flatten('_+', return_segments=True, regex=True)
>>> under_flat
array(['one', 'two', 'three', 'four', 'five', 'six'])
>>> under_map
array([0, 2, 5])
startswith(substr: bytes | arkouda.numpy.dtypes.str_scalars, regex: bool = False) arkouda.pdarrayclass.pdarray[source]

Check whether each element starts with the given substring.

Parameters:
  • substr (Union[bytes, str_scalars]) – The prefix to search for

  • regex (bool) – Indicates whether substr is a regular expression Note: only handles regular expressions supported by re2 (does not support lookaheads/lookbehinds)

Returns:

True for elements that start with substr, False otherwise

Return type:

pdarray, bool

Raises:
  • TypeError – Raised if the substr parameter is not a bytes ior str_scalars

  • ValueError – Rasied if substr is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings_end = ak.array([f'string {i}' for i in range(1, 6)])
>>> strings_end
array(['string 1', 'string 2', 'string 3', 'string 4', 'string 5'])
>>> strings_end.startswith('string')
array([True, True, True, True, True])
>>> strings_start = ak.array([f'{i} string' for i in range(1,6)])
>>> strings_start
array(['1 string', '2 string', '3 string', '4 string', '5 string'])
>>> strings_start.startswith('\d str', regex = True)
array([True, True, True, True, True])
stick(other: Strings, delimiter: bytes | arkouda.numpy.dtypes.str_scalars = '', toLeft: bool = False) Strings[source]

Join the strings from another array onto one end of the strings of this array, optionally inserting a delimiter. Warning: This function is experimental and not guaranteed to work.

Parameters:
  • other (Strings) – The strings to join onto self’s strings

  • delimiter (str) – String inserted between self and other

  • toLeft (bool) – If true, join other strings to the left of self. By default, other is joined to the right of self.

Returns:

The array of joined strings

Return type:

Strings

Raises:
  • TypeError – Raised if the delimiter parameter is not bytes or str_scalars or if the other parameter is not a Strings instance

  • ValueError – Raised if times is < 1

  • RuntimeError – Raised if there is a server-side error thrown

See also

lstick, peel, rpeel

Examples

>>> s = ak.array(['a', 'c', 'e'])
>>> t = ak.array(['b', 'd', 'f'])
>>> s.stick(t, delimiter='.')
array(['a.b', 'c.d', 'e.f'])
strip(chars: bytes | arkouda.numpy.dtypes.str_scalars | None = '') Strings[source]

Returns a new Strings object with all leading and trailing occurrences of characters contained in chars removed. The chars argument is a string specifying the set of characters to be removed. If omitted, the chars argument defaults to removing whitespace. The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.

Parameters:

chars – the set of characters to be removed

Returns:

Strings object with the leading and trailing characters matching the set of characters in the chars argument removed

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> strings = ak.array(['Strings ', '  StringS  ', 'StringS   '])
>>> s = strings.strip()
>>> s
array(['Strings', 'StringS', 'StringS'])
>>> strings = ak.array(['Strings 1', '1 StringS  ', '  1StringS  12 '])
>>> s = strings.strip(' 12')
>>> s
array(['Strings', 'StringS', 'StringS'])
sub(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Strings[source]

Return new Strings obtained by replacing non-overlapping occurrences of pattern with the replacement repl. If count is nonzero, at most count substitutions occur

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

Strings with pattern matches replaced

Return type:

Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.subn

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.sub(pattern='_+', repl='-', count=2)
array(['1-2-', '-', '3', '-4-5____6___7', ''])
subn(pattern: bytes | arkouda.numpy.dtypes.str_scalars, repl: bytes | arkouda.numpy.dtypes.str_scalars, count: int = 0) Tuple[source]

Perform the same operation as sub(), but return a tuple (new_Strings, number_of_substitions)

Parameters:
  • pattern (str_scalars) – The regex to substitue

  • repl (str_scalars) – The substring to replace pattern matches with

  • count (int) – The max number of pattern match occurences in each element to replace. The default count=0 replaces all occurences of pattern with repl

Returns:

  • Strings – Strings with pattern matches replaced

  • pdarray, int64 – The number of substitutions made for each element of Strings

Raises:
  • TypeError – Raised if pattern or repl are not bytes or str_scalars

  • ValueError – Raised if pattern is not a valid regex

  • RuntimeError – Raised if there is a server-side error thrown

See also

Strings.sub

Examples

>>> strings = ak.array(['1_2___', '____', '3', '__4___5____6___7', ''])
>>> strings.subn(pattern='_+', repl='-', count=2)
(array(['1-2-', '-', '3', '-4-5____6___7', '']), array([2 1 0 2 0]))
title() Strings[source]

Returns a new Strings from the original replaced with their titlecase equivalent.

Returns:

Strings from the original replaced with their titlecase equivalent.

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown.

See also

Strings.lower, String.upper

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.title()
array(['Strings 0', 'Strings 1', 'Strings 2', 'Strings 3', 'Strings 4'])
to_csv(prefix_path: str, dataset: str = 'strings_array', col_delim: str = ',', overwrite: bool = False)[source]

Write Strings to CSV file(s). File will contain a single column with the Strings data. All CSV Files written by Arkouda include a header denoting data types of the columns. Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

Parameters:
  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • dataset (str) – Column name to save the Strings under. Defaults to “strings_array”.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

str reponse message

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

to_hdf(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', save_offsets: bool = True, file_type: str = 'distribute') str[source]

Save the Strings object to HDF5. The object can be saved to a collection of files or single file.

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – The name of the Strings dataset to be written, defaults to strings_array

  • mode (str {'truncate' | 'append'}) – By default, truncate (overwrite) output files, if they exist. If ‘append’, create a new Strings dataset within existing files.

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • file_type (str ("single" | "distribute")) – Default: Distribute Distribute the dataset over a file per locale. Single file will save the dataset to one file

Return type:

String message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • Parquet files do not store the segments, only the values.

  • Strings state is saved as two datasets within an hdf5 group: one for the string characters and one for the segments corresponding to the start of each string

  • the hdf5 group is named via the dataset parameter.

  • The prefix_path must be visible to the arkouda server and the user must have write permission.

  • Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path.

  • If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result.

  • Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

See also

to_hdf

to_list() list[source]

Convert the SegString to a list, transferring data from the arkouda server to Python. If the SegString exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A list with the same strings as this SegString

Return type:

list

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_list()
['hello', 'my', 'world']
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the arkouda server to Python. If the array exceeds a built-in size limit, a RuntimeError is raised.

Returns:

A numpy ndarray with the same strings as this array

Return type:

np.ndarray

Notes

The number of bytes in the array cannot exceed ak.client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting ak.client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.array(["hello", "my", "world"])
>>> a.to_ndarray()
array(['hello', 'my', 'world'], dtype='<U5')
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'strings_array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the Strings object to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a Strings object to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the Strings object is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a Strings object in the arkouda server which was previously registered using register() and/or attached to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

See also

register, attach

Notes

Registered names/Strings objects in the server are immune to deletion until they are unregistered.

static unregister_strings_by_name(user_defined_name: str) None[source]

Unregister a Strings object in the arkouda server previously registered via register()

Parameters:

user_defined_name (str) – The registered name of the Strings object

update_hdf(prefix_path: str, dataset: str = 'strings_array', save_offsets: bool = True, repack: bool = True)[source]

Overwrite the dataset with the name provided with this Strings object. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • save_offsets (bool) – Defaults to True which will instruct the server to save the offsets array to HDF5 If False the offsets array will not be save and will be derived from the string values upon load/read.

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the Strings object

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

upper() Strings[source]

Returns a new Strings with all lowercase characters from the original replaced with their uppercase equivalent

Returns:

Strings with all lowercase characters from the original replaced with their uppercase equivalent

Return type:

Strings

Raises:

RuntimeError – Raised if there is a server-side error thrown

See also

Strings.lower

Examples

>>> strings = ak.array([f'StrINgS {i}' for i in range(5)])
>>> strings
array(['StrINgS 0', 'StrINgS 1', 'StrINgS 2', 'StrINgS 3', 'StrINgS 4'])
>>> strings.upper()
array(['STRINGS 0', 'STRINGS 1', 'STRINGS 2', 'STRINGS 3', 'STRINGS 4'])
class arkouda.TimeDelta64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.Timedelta(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a duration, the difference between two dates or times.

Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.

Parameters:
  • pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

abs()[source]

Absolute value of time interval.

property components
property days
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property microseconds
property nanoseconds
register(user_defined_name)[source]

Register this Timedelta object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components

Returns:

The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.

Return type:

Timedelta

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the timedelta with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property seconds
special_objType = 'Timedelta'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0)[source]

Returns the standard deviation as a pd.Timedelta object

sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

total_seconds()[source]
unregister()[source]

Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

class arkouda.Timedelta(pda, unit: str = _BASE_UNIT)[source]

Bases: _AbstractBaseTime

Represents a duration, the difference between two dates or times.

Timedelta is the Arkouda equivalent of pandas.TimedeltaIndex.

Parameters:
  • pda (int64 pdarray, pd.TimedeltaIndex, pd.Series, or np.timedelta64 array)

  • unit (str, default 'ns') –

    For int64 pdarray, denotes the unit of the input. Ignored for pandas and numpy arrays, which carry their own unit. Not case-sensitive; prefixes of full names (like ‘sec’) are accepted.

    Possible values:

    • ’weeks’ or ‘w’

    • ’days’ or ‘d’

    • ’hours’ or ‘h’

    • ’minutes’, ‘m’, or ‘t’

    • ’seconds’ or ‘s’

    • ’milliseconds’, ‘ms’, or ‘l’

    • ’microseconds’, ‘us’, or ‘u’

    • ’nanoseconds’, ‘ns’, or ‘n’

    Unlike in pandas, units cannot be combined or mixed with integers

Notes

The .values attribute is always in nanoseconds with int64 dtype.

abs()[source]

Absolute value of time interval.

property components
property days
is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry or is a component of a registered object.

Returns:

Indicates if the object is contained in the registry

Return type:

numpy.bool

Raises:

RegistrationError – Raised if there’s a server-side error or a mis-match of registered components

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property microseconds
property nanoseconds
register(user_defined_name)[source]

Register this Timedelta object and underlying components with the Arkouda server

Parameters:

user_defined_name (str) – user defined name the timedelta is to be registered under, this will be the root name for underlying components

Returns:

The same Timedelta which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different Timedeltas with the same name.

Return type:

Timedelta

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the timedelta with the user_defined_name

Notes

Objects registered with the server are immune to deletion until they are unregistered.

property seconds
special_objType = 'Timedelta'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0)[source]

Returns the standard deviation as a pd.Timedelta object

sum()[source]

Return the sum of all elements in the array.

supported_opeq
supported_with_datetime
supported_with_pdarray
supported_with_r_datetime
supported_with_r_pdarray
supported_with_r_timedelta
supported_with_timedelta
to_pandas()[source]

Convert array to a pandas TimedeltaIndex. Note: if the array size exceeds client.maxTransferBytes, a RuntimeError is raised.

See also

to_ndarray

total_seconds()[source]
unregister()[source]

Unregister this timedelta object in the arkouda server which was previously registered using register() and/or attached to using attach()

Raises:

RegistrationError – If the object is already unregistered or if there is a server error when attempting to unregister

Notes

Objects registered with the server are immune to deletion until they are unregistered.

class arkouda.TooHardError[source]

max_work was exceeded.

This is raised whenever the maximum number of candidate solutions to consider specified by the max_work parameter is exceeded. Assigning a finite number to max_work may have caused the operation to fail.

class arkouda.True_(value)

Bases: numpy.generic

Boolean type (True or False), stored as a byte.

Warning

The bool_ type is not a subclass of the int_ type (the bool_ is not even a number type). This is different than Python’s default implementation of bool as a sub-class of int.

Character code:

'?'

class arkouda.UByteDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UInt16DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UInt32DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UInt64DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UInt8DType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UIntDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.ULongDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.ULongLongDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

class arkouda.UShortDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

arkouda.VAL_SUFFIX = '_values'
class arkouda.VoidDType(obj, align=False, copy=False)

Bases: numpy.dtype

DType class corresponding to the scalar type and dtype of the same name.

Please see numpy.dtype for the typical way to create dtype instances and arrays.dtypes for additional information.

arkouda.abs(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise absolute value of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing absolute values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.abs(ak.arange(-5,-1))
array([5, 4, 3, 2])
>>> ak.abs(ak.linspace(-5,-1,5))
array([5, 4, 3, 2, 1])
arkouda.add_newdoc(place, obj, doc, warn_on_python=True)[source]

Add documentation to an existing object, typically one defined in C

The purpose is to allow easier editing of the docstrings without requiring a re-compile. This exists primarily for internal use within numpy itself.

Parameters:
  • place (str) – The absolute name of the module to import from

  • obj (str) – The name of the object to add documentation to, typically a class or function name

  • doc ({str, Tuple[str, str], List[Tuple[str, str]]}) –

    If a string, the documentation to apply to obj

    If a tuple, then the first element is interpreted as an attribute of obj and the second as the docstring to apply - (method, docstring)

    If a list, then each element of the list should be a tuple of length two - [(method1, docstring1), (method2, docstring2), ...]

  • warn_on_python (bool) – If True, the default, emit UserWarning if this is used to attach documentation to a pure-python object.

Notes

This routine never raises an error if the docstring can’t be written, but will raise an error if the object being documented does not exist.

This routine cannot modify read-only docstrings, as appear in new-style classes or built-in functions. Because this routine never raises an error the caller must check manually that the docstrings were changed.

Since this function grabs the char * from a c-level str object and puts it into the tp_doc slot of the type of obj, it violates a number of C-API best-practices, by:

  • modifying a PyTypeObject after calling PyType_Ready

  • calling Py_INCREF on the str and losing the reference, so the str will never be released

If possible it should be avoided.

arkouda.akabs(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray

Return the element-wise absolute value of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing absolute values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.abs(ak.arange(-5,-1))
array([5, 4, 3, 2])
>>> ak.abs(ak.linspace(-5,-1,5))
array([5, 4, 3, 2, 1])
class arkouda.akbool(value)

Bases: numpy.generic

Boolean type (True or False), stored as a byte.

Warning

The bool_ type is not a subclass of the int_ type (the bool_ is not even a number type). This is different than Python’s default implementation of bool as a sub-class of int.

Character code:

'?'

class arkouda.akbool(value)

Bases: numpy.generic

Boolean type (True or False), stored as a byte.

Warning

The bool_ type is not a subclass of the int_ type (the bool_ is not even a number type). This is different than Python’s default implementation of bool as a sub-class of int.

Character code:

'?'

arkouda.akcast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, dt: numpy.dtype | type | str | dtypes.dtypes.bigint, errors: _numeric.ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]

Cast an array to another dtype.

Parameters:
  • pda (pdarray or Strings) – The array of values to cast

  • dt (np.dtype, type, or str) – The target dtype to cast values to

  • errors ({strict, ignore, return_validity}) –

    Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).

    • strict: raise RuntimeError if any string cannot be converted

    • ignore: never raise an error. Uninterpretable strings get

      converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)

    • return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.

Returns:

  • pdarray or Strings – Array of values cast to desired dtype

  • [validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.

Notes

The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.

Examples

>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64)
array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype
dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool_)
array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool_)
array([False, True, True, True, True])
arkouda.akcast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, dt: numpy.dtype | type | str | dtypes.dtypes.bigint, errors: _numeric.ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray]

Cast an array to another dtype.

Parameters:
  • pda (pdarray or Strings) – The array of values to cast

  • dt (np.dtype, type, or str) – The target dtype to cast values to

  • errors ({strict, ignore, return_validity}) –

    Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).

    • strict: raise RuntimeError if any string cannot be converted

    • ignore: never raise an error. Uninterpretable strings get

      converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)

    • return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.

Returns:

  • pdarray or Strings – Array of values cast to desired dtype

  • [validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.

Notes

The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.

Examples

>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64)
array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype
dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool_)
array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool_)
array([False, True, True, True, True])
class arkouda.akfloat64(value)

Bases: numpy.floating

Double-precision floating-point number type, compatible with Python float

and C double.

Character code:

'd'

Canonical name:

numpy.double

Alias:

numpy.float_

Alias on this platform (Linux x86_64):

numpy.float64: 64-bit precision floating-point number type: sign bit, 11 bits exponent, 52 bits mantissa.

as_integer_ratio(*args, **kwargs)

double.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.double(10.0).as_integer_ratio()
(10, 1)
>>> np.double(0.0).as_integer_ratio()
(0, 1)
>>> np.double(-.25).as_integer_ratio()
(-1, 4)
fromhex(string, /)

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex(/)

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
is_integer(*args, **kwargs)

double.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.double(-2.0).is_integer()
True
>>> np.double(3.2).is_integer()
False
class arkouda.akfloat64(value)

Bases: numpy.floating

Double-precision floating-point number type, compatible with Python float

and C double.

Character code:

'd'

Canonical name:

numpy.double

Alias:

numpy.float_

Alias on this platform (Linux x86_64):

numpy.float64: 64-bit precision floating-point number type: sign bit, 11 bits exponent, 52 bits mantissa.

as_integer_ratio(*args, **kwargs)

double.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.double(10.0).as_integer_ratio()
(10, 1)
>>> np.double(0.0).as_integer_ratio()
(0, 1)
>>> np.double(-.25).as_integer_ratio()
(-1, 4)
fromhex(string, /)

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex(/)

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
is_integer(*args, **kwargs)

double.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.double(-2.0).is_integer()
True
>>> np.double(3.2).is_integer()
False
class arkouda.akint64(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.akint64(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.akint64(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.akuint64(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.akuint64(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.akuint64(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
arkouda.align(*args)[source]

Map multiple arrays of sparse identifiers to a common 0-up index.

Parameters:

*args (pdarrays or sequences of pdarrays) – Arrays to map to dense index

Returns:

aligned – Arrays with values replaced by 0-up indices

Return type:

list of pdarrays

class arkouda.all_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

arkouda.arccos(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse cosine of the array. The result is between 0 and pi.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse cosine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse cosine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arccosh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse hyperbolic cosine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse hyperbolic cosine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arcsin(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse sine of the array. The result is between -pi/2 and pi/2.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse sine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse sine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arcsinh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse hyperbolic sine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse hyperbolic sine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arctan(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse tangent of the array. The result is between -pi/2 and pi/2.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse tangent will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse tangent for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arctan2(num: arkouda.pdarrayclass.pdarray | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64, denom: arkouda.pdarrayclass.pdarray | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse tangent of the array pair. The result chosen is the signed angle in radians between the ray ending at the origin and passing through the point (1,0), and the ray ending at the origin and passing through the point (denom, num). The result is between -pi and pi.

Parameters:
  • num (Union[numeric_scalars, pdarray]) – Numerator of the arctan2 argument.

  • denom (Union[numeric_scalars, pdarray]) – Denominator of the arctan2 argument.

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse tangent will be applied to the corresponding values. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse tangent for each corresponding element pair of the original pdarray, using the signed values or the numerator and denominator to get proper placement on unit circle.

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.arctanh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise inverse hyperbolic tangent of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the inverse hyperbolic tangent will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing inverse hyperbolic tangent for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameters are not a pdarray or numeric scalar.

arkouda.argmaxk(pda: pdarray, k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Find the indices corresponding to the k maximum values of an array.

Returns the largest k values of an array, sorted

Parameters:
  • pda (pdarray) – Input array.

  • k (int_scalars) – The desired count of indices corresponding to maxmum array values

Returns:

The indices of the maximum k values from the pda, sorted

Return type:

pdarray, int

Raises:
  • TypeError – Raised if pda is not a pdarray or k is not an integer

  • ValueError – Raised if the pda is empty or k < 1

Notes

This call is equivalent in value to:

ak.argsort(a)[k:]

and generally outperforms this operation.

This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degradation has been observed.

Examples

>>> A = ak.array([10,5,1,3,7,2,9,0])
>>> ak.argmaxk(A, 3)
array([4, 6, 0])
>>> ak.argmaxk(A, 4)
array([1, 4, 6, 0])
arkouda.argmink(pda: pdarray, k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the k minimum values of an array.

Parameters:
  • pda (pdarray) – Input array.

  • k (int_scalars) – The desired count of indices corresponding to minimum array values

Returns:

The indices of the minimum k values from the pda, sorted

Return type:

pdarray, int

Raises:
  • TypeError – Raised if pda is not a pdarray or k is not an integer

  • ValueError – Raised if the pda is empty or k < 1

Notes

This call is equivalent in value to:

ak.argsort(a)[:k]

and generally outperforms this operation.

This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degradation has been observed.

Examples

>>> A = ak.array([10,5,1,3,7,2,9,0])
>>> ak.argmink(A, 3)
array([7, 2, 5])
>>> ak.argmink(A, 4)
array([7, 2, 5, 3])
arkouda.argsort(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD, axis: arkouda.numpy.dtypes.int_scalars = 0) arkouda.pdarrayclass.pdarray[source]

Return the permutation that sorts the array.

Parameters:

pda (pdarray or Strings or Categorical) – The array to sort (int64, uint64, or float64)

Returns:

The indices such that pda[indices] is sorted

Return type:

pdarray, int64

Raises:

TypeError – Raised if the parameter is other than a pdarray or Strings

See also

coargsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive.

Examples

>>> a = ak.randint(0, 10, 10)
>>> perm = ak.argsort(a)
>>> a[perm]
array([0, 1, 1, 3, 4, 5, 7, 8, 8, 9])
arkouda.argsort(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD, axis: arkouda.numpy.dtypes.int_scalars = 0) arkouda.pdarrayclass.pdarray[source]

Return the permutation that sorts the array.

Parameters:

pda (pdarray or Strings or Categorical) – The array to sort (int64, uint64, or float64)

Returns:

The indices such that pda[indices] is sorted

Return type:

pdarray, int64

Raises:

TypeError – Raised if the parameter is other than a pdarray or Strings

See also

coargsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive.

Examples

>>> a = ak.randint(0, 10, 10)
>>> perm = ak.argsort(a)
>>> a[perm]
array([0, 1, 1, 3, 4, 5, 7, 8, 8, 9])
arkouda.argsort(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD, axis: arkouda.numpy.dtypes.int_scalars = 0) arkouda.pdarrayclass.pdarray[source]

Return the permutation that sorts the array.

Parameters:

pda (pdarray or Strings or Categorical) – The array to sort (int64, uint64, or float64)

Returns:

The indices such that pda[indices] is sorted

Return type:

pdarray, int64

Raises:

TypeError – Raised if the parameter is other than a pdarray or Strings

See also

coargsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive.

Examples

>>> a = ak.randint(0, 10, 10)
>>> perm = ak.argsort(a)
>>> a[perm]
array([0, 1, 1, 3, 4, 5, 7, 8, 8, 9])
arkouda.array_equal(pda_a: arkouda.pdarrayclass.pdarray, pda_b: arkouda.pdarrayclass.pdarray, equal_nan: bool = False)[source]

Compares two pdarrays for equality. If neither array has any nan elements, then if all elements are pairwise equal, it returns True. If equal_Nan is False, then any nan element in either array gives a False return. If equal_Nan is True, then pairwise-corresponding nans are considered equal.

Parameters:
  • pda_a (pdarray)

  • pda_b (pdarray)

  • equal_nan (boolean to determine how to handle nans, default False)

Returns:

With string data:

False if one array is type ak.str_ & the other isn’t, True if both are ak.str_ & they match.

With numeric data:

True if neither array has any nan elements, and all elements pairwise equal.

True if equal_Nan True, all non-nans pairwise equal & nans in pda_a correspond to nans in pda_b

False if equal_Nan False, & either array has any nan element.

Return type:

boolean

Examples

>>> a = ak.randint(0,10,10,dtype=ak.float64)
>>> b = a
>>> ak.array_equal(a,b)
True
>>> b[9] = np.nan
>>> ak.array_equal(a,b)
False
>>> a[9] = np.nan
>>> ak.array_equal(a,b)
False
>>> ak.array_equal(a,b,True)
True
arkouda.assert_almost_equal(left, right, rtol: float = 1e-05, atol: float = 1e-08, **kwargs) None[source]

Check that the left and right objects are approximately equal.

By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.

Parameters:
  • left (object)

  • right (object)

  • rtol (float, default 1e-5) – Relative tolerance.

  • atol (float, default 1e-8) – Absolute tolerance.

Warning

This function cannot be used on pdarray of size > ak.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.

arkouda.assert_almost_equivalent(left, right, rtol: float = 1e-05, atol: float = 1e-08) None[source]

Check that the left and right objects are approximately equal.

By approximately equal, we refer to objects that are numbers or that contain numbers which may be equivalent to specific levels of precision.

If the objects are pandas or numpy objects, they are converted to arkouda objects. Then assert_almost_equal is applied to the result.

Parameters:
  • left (object)

  • right (object)

  • rtol (float, default 1e-5) – Relative tolerance.

  • atol (float, default 1e-8) – Absolute tolerance.

Warning

This function cannot be used on pdarray of size > ak.client.maxTransferBytes because it converts pdarrays to numpy arrays and calls np.allclose.

arkouda.assert_arkouda_array_equal(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that ‘ak.pdarray’ or ‘ak.Strings’, ‘ak.Categorical’, or ‘ak.SegArray’ is equivalent.

Parameters:
  • left (arkouda.pdarray or arkouda.Strings or arkouda.Categorical or arkouda.SegArray) – The two arrays to be compared.

  • right (arkouda.pdarray or arkouda.Strings or arkouda.Categorical or arkouda.SegArray) – The two arrays to be compared.

  • check_dtype (bool, default True) – Check dtype if both a and b are ak.pdarray.

  • err_msg (str, default None) – If provided, used as assertion message.

  • check_same (None|'copy'|'same', default None) – Ensure left and right refer/do not refer to the same memory area.

  • obj (str, default 'numpy array') – Specify object name being compared, internally used to show appropriate assertion message.

  • index_values (Index | arkouda.pdarray, default None) – optional index (shared by both left and right), used in output.

arkouda.assert_arkouda_array_equivalent(left: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, right: arkouda.pdarray | arkouda.Strings | arkouda.Categorical | arkouda.SegArray | numpy.ndarray | pandas.Categorical, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that ‘np.array’, ‘pd.Categorical’, ‘ak.pdarray’, ‘ak.Strings’, ‘ak.Categorical’, or ‘ak.SegArray’ is equivalent.

np.nparray’s and pd.Categorical’s will be converted to the arkouda equivalent. Then assert_arkouda_pdarray_equal will be applied to the result.

Parameters:
  • left (np.ndarray, pd.Categorical, arkouda.pdarray or arkouda.Strings or arkouda.Categorical) – The two arrays to be compared.

  • right (np.ndarray, pd.Categorical, arkouda.pdarray or arkouda.Strings or arkouda.Categorical) – The two arrays to be compared.

  • check_dtype (bool, default True) – Check dtype if both a and b are ak.pdarray or np.ndarray.

  • err_msg (str, default None) – If provided, used as assertion message.

  • check_same (None|'copy'|'same', default None) – Ensure left and right refer/do not refer to the same memory area.

  • obj (str, default 'numpy array') – Specify object name being compared, internally used to show appropriate assertion message.

  • index_values (Index | arkouda.pdarray, default None) – optional index (shared by both left and right), used in output.

arkouda.assert_arkouda_pdarray_equal(left: arkouda.pdarray, right: arkouda.pdarray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'pdarray', index_values=None) None[source]

Check that the two ‘ak.pdarray’s are equivalent.

Parameters:
  • left (arkouda.pdarray) – The two arrays to be compared.

  • right (arkouda.pdarray) – The two arrays to be compared.

  • check_dtype (bool, default True) – Check dtype if both a and b are ak.pdarray.

  • err_msg (str, default None) – If provided, used as assertion message.

  • check_same (None|'copy'|'same', default None) – Ensure left and right refer/do not refer to the same memory area.

  • obj (str, default 'pdarray') – Specify object name being compared, internally used to show appropriate assertion message.

  • index_values (Index | arkouda.pdarray, default None) – optional index (shared by both left and right), used in output.

arkouda.assert_arkouda_segarray_equal(left: arkouda.SegArray, right: arkouda.SegArray, check_dtype: bool = True, err_msg=None, check_same=None, obj: str = 'segarray') None[source]

Check that the two ‘ak.segarray’s are equivalent.

Parameters:
  • left (arkouda.segarray) – The two segarrays to be compared.

  • right (arkouda.segarray) – The two segarrays to be compared.

  • check_dtype (bool, default True) – Check dtype if both a and b are ak.pdarray.

  • err_msg (str, default None) – If provided, used as assertion message.

  • check_same (None|'copy'|'same', default None) – Ensure left and right refer/do not refer to the same memory area.

  • obj (str, default 'pdarray') – Specify object name being compared, internally used to show appropriate assertion message.

arkouda.assert_arkouda_strings_equal(left, right, err_msg=None, check_same=None, obj: str = 'Strings', index_values=None) None[source]

Check that ‘ak.Strings’ is equivalent.

Parameters:
  • left (arkouda.Strings) – The two Strings to be compared.

  • right (arkouda.Strings) – The two Strings to be compared.

  • err_msg (str, default None) – If provided, used as assertion message.

  • check_same (None|'copy'|'same', default None) – Ensure left and right refer/do not refer to the same memory area.

  • obj (str, default 'Strings') – Specify object name being compared, internally used to show appropriate assertion message.

  • index_values (Index | arkouda.pdarray, default None) – optional index (shared by both left and right), used in output.

arkouda.assert_attr_equal(attr: str, left, right, obj: str = 'Attributes') None[source]

Check attributes are equal. Both objects must have attribute.

Parameters:
  • attr (str) – Attribute name being compared.

  • left (object)

  • right (object)

  • obj (str, default 'Attributes') – Specify object name being compared, internally used to show appropriate assertion message

arkouda.assert_categorical_equal(left, right, check_dtype: bool = True, check_category_order: bool = True, obj: str = 'Categorical') None[source]

Test that Categoricals are equivalent.

Parameters:
  • left (Categorical)

  • right (Categorical)

  • check_dtype (bool, default True) – Check that integer dtype of the codes are the same.

  • check_category_order (bool, default True) – Whether the order of the categories should be compared, which implies identical integer codes. If False, only the resulting values are compared. The ordered attribute is checked regardless.

  • obj (str, default 'Categorical') – Specify object name being compared, internally used to show appropriate assertion message.

arkouda.assert_class_equal(left, right, exact: bool = True, obj: str = 'Input') None[source]

Checks classes are equal.

arkouda.assert_contains_all(iterable, dic) None[source]

Assert that a dictionary contains all the elements of an iterable. :param iterable: :type iterable: iterable :param dic: :type dic: dict

arkouda.assert_copy(iter1, iter2, **eql_kwargs) None[source]

Checks that the elements are equal, but not the same object. (Does not check that items in sequences are also not the same object.)

Parameters:
  • iter1 (iterable) – Iterables that produce elements comparable with assert_almost_equal.

  • iter2 (iterable) – Iterables that produce elements comparable with assert_almost_equal.

arkouda.assert_dict_equal(left, right, compare_keys: bool = True) None[source]

Assert that two dictionaries are equal. Values must be arkouda objects. :param left: The dictionaries to be compared. :type left: dict :param right: The dictionaries to be compared. :type right: dict :param compare_keys: Whether to compare the keys.

If False, only the values are compared.

arkouda.assert_equal(left, right, **kwargs) None[source]

Wrapper for tm.assert_*_equal to dispatch to the appropriate test function.

Parameters:
arkouda.assert_equivalent(left, right, **kwargs) None[source]

Wrapper for tm.assert_*_equivalent to dispatch to the appropriate test function.

Parameters:
  • left (Index, pd.Index, Series, pd.Series, DataFrame, pd.DataFrame,)

  • right (Index, pd.Index, Series, pd.Series, DataFrame, pd.DataFrame,)

  • Strings – The two items to be compared.

  • Categorical – The two items to be compared.

  • pd.Categorical – The two items to be compared.

  • SegArray – The two items to be compared.

  • pdarray – The two items to be compared.

  • np.ndarray – The two items to be compared.

:param : The two items to be compared. :param **kwargs: All keyword arguments are passed through to the underlying assert method.

arkouda.assert_frame_equal(left: arkouda.DataFrame, right: arkouda.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool = True, check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]

Check that left and right DataFrame are equal.

This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.

Parameters:
  • left (DataFrame) – First DataFrame to compare.

  • right (DataFrame) – Second DataFrame to compare.

  • check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.

  • check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type are identical. Is passed as the exact argument of assert_index_equal().

  • check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.

  • check_names (bool, default True) – Whether to check that the names attribute for both the index and column attributes of the DataFrame is identical.

  • check_exact (bool, default False) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_like (bool, default False) – If True, ignore the order of index & columns. Note: index labels must match their respective rows (same as in columns) - same labels must be with the same data.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate assertion message.

See also

assert_series_equal

Equivalent method for asserting Series equality.

Examples

This example shows comparing two DataFrames that are equal but with columns of differing dtypes.

>>> from arkouda.testing import assert_frame_equal
>>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df2 = ak.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})

df1 equals itself.

>>> assert_frame_equal(df1, df1)

df1 differs from df2 as column ‘b’ is of a different type.

>>> assert_frame_equal(df1, df2)
Traceback (most recent call last):
...
AssertionError: Attributes of DataFrame.iloc[:, 1] (column name="b") are different

Attribute “dtype” are different [left]: int64 [right]: float64

Ignore differing dtypes in columns with check_dtype.

>>> assert_frame_equal(df1, df2, check_dtype=False)
arkouda.assert_frame_equivalent(left: arkouda.DataFrame | pandas.DataFrame, right: arkouda.DataFrame | pandas.DataFrame, check_dtype: bool = True, check_index_type: bool = True, check_column_type: bool = True, check_frame_type: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_like: bool = False, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'DataFrame') None[source]

Check that left and right DataFrame are equal.

This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.

pd.DataFrame’s will be converted to the arkouda equivalent. Then assert_frame_equal will be applied to the result.

Parameters:
  • left (DataFrame or pd.DataFrame) – First DataFrame to compare.

  • right (DataFrame or pd.DataFrame) – Second DataFrame to compare.

  • check_dtype (bool, default True) – Whether to check the DataFrame dtype is identical.

  • check_index_type (bool, default = True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_column_type (bool or {'equiv'}, default 'equiv') – Whether to check the columns class, dtype and inferred_type are identical. Is passed as the exact argument of assert_index_equal().

  • check_frame_type (bool, default True) – Whether to check the DataFrame class is identical.

  • check_names (bool, default True) – Whether to check that the names attribute for both the index and column attributes of the DataFrame is identical.

  • check_exact (bool, default False) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_like (bool, default False) – If True, ignore the order of index & columns. Note: index labels must match their respective rows (same as in columns) - same labels must be with the same data.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'DataFrame') – Specify object name being compared, internally used to show appropriate assertion message.

Examples

This example shows comparing two DataFrames that are equal but with columns of differing dtypes.

>>> from arkouda.testing import assert_frame_equivalent
>>> import pandas as pd
>>> df1 = ak.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df2 = pd.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
>>> assert_frame_equivalent(df1, df1)
arkouda.assert_index_equal(left: arkouda.Index, right: arkouda.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]

Check that left and right Index are equal.

Parameters:
  • left (Index)

  • right (Index)

  • exact (True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_names (bool, default True) – Whether to check the names attribute.

  • check_exact (bool, default True) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_order (bool, default True) – Whether to compare the order of index entries as well as their values. If True, both indexes must contain the same elements, in the same order. If False, both indexes must contain the same elements, but in any order.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate assertion message.

Examples

>>> from arkouda import testing as tm
>>> a = ak.Index([1, 2, 3])
>>> b = ak.Index([1, 2, 3])
>>> tm.assert_index_equal(a, b)
arkouda.assert_index_equivalent(left: arkouda.Index | pandas.Index, right: arkouda.Index | pandas.Index, exact: bool = True, check_names: bool = True, check_exact: bool = True, check_categorical: bool = True, check_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Index') None[source]

Check that left and right Index are equal.

If the objects are pandas.Index, they are converted to arkouda.Index. Then assert_almost_equal is applied to the result.

Parameters:
  • left (Index or pandas.Index)

  • right (Index or pandas.Index)

  • exact (True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_names (bool, default True) – Whether to check the names attribute.

  • check_exact (bool, default True) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_order (bool, default True) – Whether to compare the order of index entries as well as their values. If True, both indexes must contain the same elements, in the same order. If False, both indexes must contain the same elements, but in any order.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'Index') – Specify object name being compared, internally used to show appropriate assertion message.

Examples

>>> from arkouda import testing as tm
>>> import pandas as pd
>>> a = ak.Index([1, 2, 3])
>>> b = pd.Index([1, 2, 3])
>>> tm.assert_index_equivalent(a, b)
arkouda.assert_is_sorted(seq) None[source]

Assert that the sequence is sorted.

arkouda.assert_series_equal(left, right, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]

Check that left and right Series are equal.

Parameters:
  • left (Series)

  • right (Series)

  • check_dtype (bool, default True) – Whether to check the Series dtype is identical.

  • check_index_type (bool, default True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_series_type (bool, default True) – Whether to check the Series class is identical.

  • check_names (bool, default True) – Whether to check the Series and Index names attribute.

  • check_exact (bool, default False) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate assertion message.

  • check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.

  • check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False. Note: same labels must be with the same data.

Examples

>>> from arkouda import testing as tm
>>> a = ak.Series([1, 2, 3, 4])
>>> b = ak.Series([1, 2, 3, 4])
>>> tm.assert_series_equal(a, b)
arkouda.assert_series_equivalent(left: arkouda.Series | pandas.Series, right: arkouda.Series | pandas.Series, check_dtype: bool = True, check_index_type: bool = True, check_series_type: bool = True, check_names: bool = True, check_exact: bool = False, check_categorical: bool = True, check_category_order: bool = True, rtol: float = 1e-05, atol: float = 1e-08, obj: str = 'Series', *, check_index: bool = True, check_like: bool = False) None[source]

Check that left and right Series are equal.

pd.Series’s will be converted to the arkouda equivalent. Then assert_series_equal will be applied to the result.

Parameters:
  • left (Series or pd.Series)

  • right (Series or pd.Series)

  • check_dtype (bool, default True) – Whether to check the Series dtype is identical.

  • check_index_type (bool, default True) – Whether to check the Index class, dtype and inferred_type are identical.

  • check_series_type (bool, default True) – Whether to check the Series class is identical.

  • check_names (bool, default True) – Whether to check the Series and Index names attribute.

  • check_exact (bool, default False) – Whether to compare number exactly.

  • check_categorical (bool, default True) – Whether to compare internal Categorical exactly.

  • check_category_order (bool, default True) – Whether to compare category order of internal Categoricals.

  • rtol (float, default 1e-5) – Relative tolerance. Only used when check_exact is False.

  • atol (float, default 1e-8) – Absolute tolerance. Only used when check_exact is False.

  • obj (str, default 'Series') – Specify object name being compared, internally used to show appropriate assertion message.

  • check_index (bool, default True) – Whether to check index equivalence. If False, then compare only values.

  • check_like (bool, default False) – If True, ignore the order of the index. Must be False if check_index is False. Note: same labels must be with the same data.

Examples

>>> from arkouda import testing as tm
>>> import pandas as pd
>>> a = ak.Series([1, 2, 3, 4])
>>> b = pd.Series([1, 2, 3, 4])
>>> tm.assert_series_equivalent(a, b)
arkouda.attach(name: str)[source]
arkouda.attach_all(names: list)[source]

Attach to all objects registered with the names provide

Parameters:

names (list) – List of names to attach to

Return type:

dict

arkouda.attach_pdarray(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.attach_pdarray("my_zeros")
>>> # ...other work...
>>> b.unregister()
arkouda.base_repr(number, base=2, padding=0)

Return a string representation of a number in the given base system.

Parameters:
  • number (int) – The value to convert. Positive and negative values are handled.

  • base (int, optional) – Convert number to the base number system. The valid range is 2-36, the default value is 2.

  • padding (int, optional) – Number of zeros padded on the left. Default is 0 (no padding).

Returns:

out – String representation of number in base system.

Return type:

str

See also

binary_repr

Faster version of base_repr for base 2.

Examples

>>> np.base_repr(5)
'101'
>>> np.base_repr(6, 5)
'11'
>>> np.base_repr(7, base=5, padding=3)
'00012'
>>> np.base_repr(10, base=16)
'A'
>>> np.base_repr(32, base=16)
'20'
class arkouda.bigint[source]
itemsize(*args, **kwargs)

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

name(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

ndim(*args, **kwargs)

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

shape(*args, **kwargs)

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

type(x)[source]
class arkouda.bigint[source]
itemsize(*args, **kwargs)

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

name(*args, **kwargs)

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

ndim(*args, **kwargs)

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

shape(*args, **kwargs)

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable’s items.

If the argument is a tuple, the return value is the same object.

type(x)[source]
arkouda.binary_repr(num, width=None)

Return the binary representation of the input number as a string.

For negative numbers, if width is not given, a minus sign is added to the front. If width is given, the two’s complement of the number is returned, with respect to that width.

In a two’s-complement system negative numbers are represented by the two’s complement of the absolute value. This is the most common method of representing signed integers on computers [1]_. A N-bit two’s-complement system can represent every integer in the range \(-2^{N-1}\) to \(+2^{N-1}-1\).

Parameters:
  • num (int) – Only an integer decimal number can be used.

  • width (int, optional) –

    The length of the returned string if num is positive, or the length of the two’s complement if num is negative, provided that width is at least a sufficient number of bits for num to be represented in the designated form.

    If the width value is insufficient, it will be ignored, and num will be returned in binary (num > 0) or two’s complement (num < 0) form with its width equal to the minimum number of bits needed to represent the number in the designated form. This behavior is deprecated and will later raise an error.

    Deprecated since version 1.12.0.

Returns:

bin – Binary representation of num or two’s complement of num.

Return type:

str

See also

base_repr

Return a string representation of a number in the given base system.

bin

Python’s built-in binary representation generator of an integer.

Notes

binary_repr is equivalent to using base_repr with base 2, but about 25x faster.

References

Examples

>>> np.binary_repr(3)
'11'
>>> np.binary_repr(-3)
'-11'
>>> np.binary_repr(3, width=4)
'0011'

The two’s complement is returned when the input number is negative and width is specified:

>>> np.binary_repr(-3, width=3)
'101'
>>> np.binary_repr(-3, width=5)
'11101'
class arkouda.bitType(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.bitType(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.bool_(value)

Bases: numpy.generic

Boolean type (True or False), stored as a byte.

Warning

The bool_ type is not a subclass of the int_ type (the bool_ is not even a number type). This is different than Python’s default implementation of bool as a sub-class of int.

Character code:

'?'

class arkouda.bool_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.bool_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

arkouda.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

Parameters:
  • segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.

  • values (pdarray, Strings) – The values to broadcast, one per row (or group)

  • size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.

  • permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.

Returns:

The broadcast values, one per nonzero

Return type:

pdarray, Strings

Raises:

ValueError

  • If segments and values are different sizes

  • If segments are empty

  • If number of nonzeros (either user-specified or inferred from permutation) is less than one

Examples

>>>
# Define a sparse matrix with 3 rows and 7 nonzeros
>>> row_starts = ak.array([0, 2, 5])
>>> nnz = 7
# Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
# If the original nonzeros were in reverse order...
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
arkouda.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

Parameters:
  • segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.

  • values (pdarray, Strings) – The values to broadcast, one per row (or group)

  • size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.

  • permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.

Returns:

The broadcast values, one per nonzero

Return type:

pdarray, Strings

Raises:

ValueError

  • If segments and values are different sizes

  • If segments are empty

  • If number of nonzeros (either user-specified or inferred from permutation) is less than one

Examples

>>>
# Define a sparse matrix with 3 rows and 7 nonzeros
>>> row_starts = ak.array([0, 2, 5])
>>> nnz = 7
# Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
# If the original nonzeros were in reverse order...
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
arkouda.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

Parameters:
  • segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.

  • values (pdarray, Strings) – The values to broadcast, one per row (or group)

  • size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.

  • permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.

Returns:

The broadcast values, one per nonzero

Return type:

pdarray, Strings

Raises:

ValueError

  • If segments and values are different sizes

  • If segments are empty

  • If number of nonzeros (either user-specified or inferred from permutation) is less than one

Examples

>>>
# Define a sparse matrix with 3 rows and 7 nonzeros
>>> row_starts = ak.array([0, 2, 5])
>>> nnz = 7
# Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
# If the original nonzeros were in reverse order...
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
arkouda.broadcast(segments: pdarray, values: pdarray | Strings, size: int | np.int64 | np.uint64 = -1, permutation: pdarray | None = None)[source]

Broadcast a dense column vector to the rows of a sparse matrix or grouped array.

Parameters:
  • segments (pdarray, int64) – Offsets of the start of each row in the sparse matrix or grouped array. Must be sorted in ascending order.

  • values (pdarray, Strings) – The values to broadcast, one per row (or group)

  • size (int) – The total number of nonzeros in the matrix. If permutation is given, this argument is ignored and the size is inferred from the permutation array.

  • permutation (pdarray, int64) – The permutation to go from the original ordering of nonzeros to the ordering grouped by row. To broadcast values back to the original ordering, this permutation will be inverted. If no permutation is supplied, it is assumed that the original nonzeros were already grouped by row. In this case, the size argument must be given.

Returns:

The broadcast values, one per nonzero

Return type:

pdarray, Strings

Raises:

ValueError

  • If segments and values are different sizes

  • If segments are empty

  • If number of nonzeros (either user-specified or inferred from permutation) is less than one

Examples

>>>
# Define a sparse matrix with 3 rows and 7 nonzeros
>>> row_starts = ak.array([0, 2, 5])
>>> nnz = 7
# Broadcast the row number to each nonzero element
>>> row_number = ak.arange(3)
>>> ak.broadcast(row_starts, row_number, nnz)
array([0 0 1 1 1 2 2])
# If the original nonzeros were in reverse order...
>>> permutation = ak.arange(6, -1, -1)
>>> ak.broadcast(row_starts, row_number, permutation=permutation)
array([2 2 1 1 1 0 0])
arkouda.broadcast_dims(sa: Sequence[int], sb: Sequence[int]) Tuple[int, Ellipsis][source]

Algorithm to determine shape of broadcasted PD array given two array shapes

see: https://data-apis.org/array-api/latest/API_specification/broadcasting.html#algorithm

arkouda.broadcast_to_shape(pda: pdarray, shape: Tuple[int, Ellipsis]) pdarray[source]

expand an array’s rank to the specified shape using broadcasting

class arkouda.byte(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C char.

Character code:

'b'

Canonical name:

numpy.byte

Alias on this platform (Linux x86_64):

numpy.int8: 8-bit signed integer (-128 to 127).

bit_count(*args, **kwargs)

int8.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int8(127).bit_count()
7
>>> np.int8(-127).bit_count()
7
class arkouda.bytes_

A byte string.

When used in arrays, this type strips trailing null bytes.

Character code:

'S'

Alias:

numpy.string_

T(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.T.

all(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.all.

any(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.any.

argmax(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmax.

argmin(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmin.

argsort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argsort.

astype(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.astype.

base(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.base.

byteswap(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.byteswap.

choose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.choose.

clip(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.clip.

compress(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.compress.

conj(*args, **kwargs)
conjugate(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.conjugate.

copy(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.copy.

cumprod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumprod.

cumsum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumsum.

data(*args, **kwargs)

Pointer to start of data.

diagonal(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.diagonal.

dtype(*args, **kwargs)

Get array data-descriptor.

dump(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dump.

dumps(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dumps.

fill(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.fill.

flags(*args, **kwargs)

The integer value of flags.

flat(*args, **kwargs)

A 1-D view of the scalar.

flatten(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.flatten.

getfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.getfield.

imag(*args, **kwargs)

The imaginary part of the scalar.

item(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.item.

itemset(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.itemset.

itemsize(*args, **kwargs)

The length of one element in bytes.

max(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.max.

mean(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.mean.

min(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.min.

nbytes(*args, **kwargs)

The length of the scalar in bytes.

ndim(*args, **kwargs)

The number of array dimensions.

newbyteorder(*args, **kwargs)

newbyteorder(new_order=’S’, /)

Return a new dtype with a different byte order.

Changes are also made in all fields and sub-arrays of the data type.

The new_order code can be any from the following:

  • ‘S’ - swap dtype from current to opposite endian

  • {‘<’, ‘little’} - little endian

  • {‘>’, ‘big’} - big endian

  • {‘=’, ‘native’} - native order

  • {‘|’, ‘I’} - ignore (no change to byte order)

new_orderstr, optional

Byte order to force; a value from the byte order specifications above. The default value (‘S’) results in swapping the current byte order.

new_dtypedtype

New dtype object with the given change to the byte order.

nonzero(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.nonzero.

prod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.prod.

ptp(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ptp.

put(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.put.

ravel(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ravel.

real(*args, **kwargs)

The real part of the scalar.

repeat(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.repeat.

reshape(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.reshape.

resize(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.resize.

round(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.round.

searchsorted(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.searchsorted.

setfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setfield.

setflags(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setflags.

shape(*args, **kwargs)

Tuple of array dimensions.

size(*args, **kwargs)

The number of elements in the gentype.

sort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sort.

squeeze(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.squeeze.

std(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.std.

strides(*args, **kwargs)

Tuple of bytes steps in each dimension.

sum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sum.

swapaxes(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.swapaxes.

take(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.take.

tobytes(*args, **kwargs)
tofile(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tofile.

tolist(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tolist.

tostring(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tostring.

trace(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.trace.

transpose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.transpose.

var(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.var.

view(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.view.

arkouda.cast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, dt: numpy.dtype | type | str | dtypes.dtypes.bigint, errors: _numeric.ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Cast an array to another dtype.

Parameters:
  • pda (pdarray or Strings) – The array of values to cast

  • dt (np.dtype, type, or str) – The target dtype to cast values to

  • errors ({strict, ignore, return_validity}) –

    Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).

    • strict: raise RuntimeError if any string cannot be converted

    • ignore: never raise an error. Uninterpretable strings get

      converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)

    • return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.

Returns:

  • pdarray or Strings – Array of values cast to desired dtype

  • [validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.

Notes

The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.

Examples

>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64)
array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype
dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool_)
array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool_)
array([False, True, True, True, True])
arkouda.cast(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, dt: numpy.dtype | type | str | dtypes.dtypes.bigint, errors: _numeric.ErrorMode = ErrorMode.strict) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical | Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Cast an array to another dtype.

Parameters:
  • pda (pdarray or Strings) – The array of values to cast

  • dt (np.dtype, type, or str) – The target dtype to cast values to

  • errors ({strict, ignore, return_validity}) –

    Controls how errors are handled when casting strings to a numeric type (ignored for casts from numeric types).

    • strict: raise RuntimeError if any string cannot be converted

    • ignore: never raise an error. Uninterpretable strings get

      converted to NaN (float64), -2**63 (int64), zero (uint64 and uint8), or False (bool)

    • return_validity: in addition to returning the same output as “ignore”, also return a bool array indicating where the cast was successful.

Returns:

  • pdarray or Strings – Array of values cast to desired dtype

  • [validity (pdarray(bool)]) – If errors=”return_validity” and input is Strings, a second array is returned with True where the cast succeeded and False where it failed.

Notes

The cast is performed according to Chapel’s casting rules and is NOT safe from overflows or underflows. The user must ensure that the target dtype has the precision and capacity to hold the desired result.

Examples

>>> ak.cast(ak.linspace(1.0,5.0,5), dt=ak.int64)
array([1, 2, 3, 4, 5])
>>> ak.cast(ak.arange(0,5), dt=ak.float64).dtype
dtype('float64')
>>> ak.cast(ak.arange(0,5), dt=ak.bool_)
array([False, True, True, True, True])
>>> ak.cast(ak.linspace(0,4,5), dt=ak.bool_)
array([False, True, True, True, True])
class arkouda.cdouble(value)

Bases: numpy.complexfloating

Complex number type composed of two double-precision floating-point

numbers, compatible with Python complex.

Character code:

'D'

Canonical name:

numpy.cdouble

Alias:

numpy.cfloat

Alias:

numpy.complex_

Alias on this platform (Linux x86_64):

numpy.complex128: Complex number type composed of 2 64-bit-precision floating-point numbers.

arkouda.ceil(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise ceiling of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing ceiling values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.ceil(ak.linspace(1.1,5.5,5))
array([2, 3, 4, 5, 6])
class arkouda.cfloat(value)

Bases: numpy.complexfloating

Complex number type composed of two double-precision floating-point

numbers, compatible with Python complex.

Character code:

'D'

Canonical name:

numpy.cdouble

Alias:

numpy.cfloat

Alias:

numpy.complex_

Alias on this platform (Linux x86_64):

numpy.complex128: Complex number type composed of 2 64-bit-precision floating-point numbers.

class arkouda.character(value)

Bases: numpy.flexible

Abstract base class of all character string scalar types.

arkouda.chisquare(f_obs, f_exp=None, ddof=0)[source]

Computes the chi square statistic and p-value.

Parameters:
  • f_obs (pdarray) – The observed frequency.

  • f_exp (pdarray, default = None) – The expected frequency.

  • ddof (int) – The delta degrees of freedom.

Return type:

arkouda.akstats.Power_divergenceResult

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.stats import chisquare
>>> chisquare(ak.array([10, 20, 30, 10]), ak.array([10, 30, 20, 10]))
Power_divergenceResult(statistic=8.333333333333334, pvalue=0.03960235520756414)

See also

scipy.stats.chisquare, arkouda.akstats.power_divergence

References

[1] “Chi-squared test”, https://en.wikipedia.org/wiki/Chi-squared_test

[2] “scipy.stats.chisquare”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

arkouda.clear() None[source]

Send a clear message to clear all unregistered data from the server symbol table

Return type:

None

Raises:

RuntimeError – Raised if there is a server-side error in executing clear request

arkouda.clip(pda: arkouda.pdarrayclass.pdarray, lo: float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray, hi: float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Clip (limit) the values in an array to a given range [lo,hi]

Given an array a, values outside the range are clipped to the range edges, such that all elements lie in the range.

There is no check to enforce that lo < hi. If lo > hi, the corresponding value of the array will be set to hi.

If lo or hi (or both) are pdarrays, the check is by pairwise elements. See examples.

Parameters:
  • pda (pdarray, int64 or float64) – the array of values to clip

  • lo (scalar or pdarray, int64 or float64) – the lower value of the clipping range

  • hi (scalar or pdarray, int64 or float64) – the higher value of the clipping range

  • pdarrays (If lo or hi (or both) are) – See examples.

  • elements. (the check is by pairwise) – See examples.

Returns:

A pdarray matching pda, except that element x remains x if lo <= x <= hi,

or becomes lo if x < lo, or becomes hi if x > hi.

Return type:

arkouda.pdarrayclass.pdarray

Examples

>>> a = ak.array([1,2,3,4,5,6,7,8,9,10])
>>> ak.clip(a,3,8)
array([3,3,3,4,5,6,7,8,8,8])
>>> ak.clip(a,3,8.0)
array([3.00000000000000000 3.00000000000000000 3.00000000000000000 4.00000000000000000
       5.00000000000000000 6.00000000000000000 7.00000000000000000 8.00000000000000000
       8.00000000000000000 8.00000000000000000])
>>> ak.clip(a,None,7)
array([1,2,3,4,5,6,7,7,7,7])
>>> ak.clip(a,5,None)
array([5,5,5,5,5,6,7,8,9,10])
>>> ak.clip(a,None,None)
ValueError : either min or max must be supplied
>>> ak.clip(a,ak.array([2,2,3,3,8,8,5,5,6,6],8))
array([2,2,3,4,8,8,7,8,8,8])
>>> ak.clip(a,4,ak.array([10,9,8,7,6,5,5,5,5,5]))
array([4,4,4,4,5,5,5,5,5,5])

Notes

Either lo or hi may be None, but not both. If lo > hi, all x = hi. If all inputs are int64, output is int64, but if any input is float64, output is float64.

Raises:

ValueError – Raised if both lo and hi are None

class arkouda.clongdouble(value)

Bases: numpy.complexfloating

Complex number type composed of two extended-precision floating-point

numbers.

Character code:

'G'

Alias:

numpy.clongfloat

Alias:

numpy.longcomplex

Alias on this platform (Linux x86_64):

numpy.complex256: Complex number type composed of 2 128-bit extended-precision floating-point numbers.

class arkouda.clongfloat(value)

Bases: numpy.complexfloating

Complex number type composed of two extended-precision floating-point

numbers.

Character code:

'G'

Alias:

numpy.clongfloat

Alias:

numpy.longcomplex

Alias on this platform (Linux x86_64):

numpy.complex256: Complex number type composed of 2 128-bit extended-precision floating-point numbers.

arkouda.clz(pda: pdarray) pdarray[source]

Count leading zeros for each integer in an array.

Parameters:

pda (pdarray, int64, uint64, bigint) – Input array (must be integral).

Returns:

lz – The number of leading zeros of each element.

Return type:

pdarray

Raises:

TypeError – If input array is not int64, uint64, or bigint

Examples

>>> A = ak.arange(10)
>>> ak.clz(A)
array([64, 63, 62, 62, 61, 61, 61, 61, 60, 60])
arkouda.coargsort(arrays: Sequence[arkouda.strings.Strings | arkouda.pdarrayclass.pdarray | arkouda.categorical.Categorical], algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD) arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the rows (left-to-right), if the input arrays are treated as columns. The permutation sorts numeric columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.

Parameters:

arrays (Sequence[Union[Strings, pdarray, Categorical]]) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row

Returns:

The indices that permute the rows to grouped order

Return type:

pdarray, int64

Raises:

ValueError – Raised if the pdarrays are not of the same size or if the parameter is not an Iterable containing pdarrays, Strings, or Categoricals

See also

argsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive. Starts with the last array and moves forward. This sort operates directly on numeric types, but for Strings, it operates on a hash. Thus, while grouping of equivalent strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals, coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories but not lexicographic ordering of those groups.

Examples

>>> a = ak.array([0, 1, 0, 1])
>>> b = ak.array([1, 1, 0, 0])
>>> perm = ak.coargsort([a, b])
>>> perm
array([2, 0, 3, 1])
>>> a[perm]
array([0, 0, 1, 1])
>>> b[perm]
array([0, 1, 0, 1])
arkouda.coargsort(arrays: Sequence[arkouda.strings.Strings | arkouda.pdarrayclass.pdarray | arkouda.categorical.Categorical], algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD) arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the rows (left-to-right), if the input arrays are treated as columns. The permutation sorts numeric columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.

Parameters:

arrays (Sequence[Union[Strings, pdarray, Categorical]]) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row

Returns:

The indices that permute the rows to grouped order

Return type:

pdarray, int64

Raises:

ValueError – Raised if the pdarrays are not of the same size or if the parameter is not an Iterable containing pdarrays, Strings, or Categoricals

See also

argsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive. Starts with the last array and moves forward. This sort operates directly on numeric types, but for Strings, it operates on a hash. Thus, while grouping of equivalent strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals, coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories but not lexicographic ordering of those groups.

Examples

>>> a = ak.array([0, 1, 0, 1])
>>> b = ak.array([1, 1, 0, 0])
>>> perm = ak.coargsort([a, b])
>>> perm
array([2, 0, 3, 1])
>>> a[perm]
array([0, 0, 1, 1])
>>> b[perm]
array([0, 1, 0, 1])
arkouda.coargsort(arrays: Sequence[arkouda.strings.Strings | arkouda.pdarrayclass.pdarray | arkouda.categorical.Categorical], algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD) arkouda.pdarrayclass.pdarray[source]

Return the permutation that groups the rows (left-to-right), if the input arrays are treated as columns. The permutation sorts numeric columns, but not strings/Categoricals – strings/Categoricals are grouped, but not ordered.

Parameters:

arrays (Sequence[Union[Strings, pdarray, Categorical]]) – The columns (int64, uint64, float64, Strings, or Categorical) to sort by row

Returns:

The indices that permute the rows to grouped order

Return type:

pdarray, int64

Raises:

ValueError – Raised if the pdarrays are not of the same size or if the parameter is not an Iterable containing pdarrays, Strings, or Categoricals

See also

argsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive. Starts with the last array and moves forward. This sort operates directly on numeric types, but for Strings, it operates on a hash. Thus, while grouping of equivalent strings is guaranteed, lexicographic ordering of the groups is not. For Categoricals, coargsort sorts based on Categorical.codes which guarantees grouping of equivalent categories but not lexicographic ordering of those groups.

Examples

>>> a = ak.array([0, 1, 0, 1])
>>> b = ak.array([1, 1, 0, 0])
>>> perm = ak.coargsort([a, b])
>>> perm
array([2, 0, 3, 1])
>>> a[perm]
array([0, 0, 1, 1])
>>> b[perm]
array([0, 1, 0, 1])
class arkouda.complex128(value)

Bases: numpy.complexfloating

Complex number type composed of two double-precision floating-point

numbers, compatible with Python complex.

Character code:

'D'

Canonical name:

numpy.cdouble

Alias:

numpy.cfloat

Alias:

numpy.complex_

Alias on this platform (Linux x86_64):

numpy.complex128: Complex number type composed of 2 64-bit-precision floating-point numbers.

class arkouda.complex64(value)

Bases: numpy.complexfloating

Complex number type composed of two single-precision floating-point

numbers.

Character code:

'F'

Canonical name:

numpy.csingle

Alias:

numpy.singlecomplex

Alias on this platform (Linux x86_64):

numpy.complex64: Complex number type composed of 2 32-bit-precision floating-point numbers.

arkouda.compute_join_size(a: arkouda.pdarrayclass.pdarray, b: arkouda.pdarrayclass.pdarray) Tuple[int, int][source]

Compute the internal size of a hypothetical join between a and b. Returns both the number of elements and number of bytes required for the join.

arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Sequence[arkouda.categorical.Categorical][source]

Concatenate a list or tuple of pdarray or Strings objects into one pdarray or Strings object, respectively.

Parameters:
  • arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

Single pdarray or Strings object containing all values, returned in the original order

Return type:

Union[pdarray,Strings,Categorical]

Raises:
  • ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes

  • TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple

  • RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.

Examples

>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])])
array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])])
array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])])
array(['one', 'two', 'three', 'four', 'five'])
arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Sequence[arkouda.categorical.Categorical][source]

Concatenate a list or tuple of pdarray or Strings objects into one pdarray or Strings object, respectively.

Parameters:
  • arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

Single pdarray or Strings object containing all values, returned in the original order

Return type:

Union[pdarray,Strings,Categorical]

Raises:
  • ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes

  • TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple

  • RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.

Examples

>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])])
array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])])
array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])])
array(['one', 'two', 'three', 'four', 'five'])
arkouda.concatenate(arrays: Sequence[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical], ordered: bool = True) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical | Sequence[arkouda.categorical.Categorical][source]

Concatenate a list or tuple of pdarray or Strings objects into one pdarray or Strings object, respectively.

Parameters:
  • arrays (Sequence[Union[pdarray,Strings,Categorical]]) – The arrays to concatenate. Must all have same dtype.

  • ordered (bool) – If True (default), the arrays will be appended in the order given. If False, array data may be interleaved in blocks, which can greatly improve performance but results in non-deterministic ordering of elements.

Returns:

Single pdarray or Strings object containing all values, returned in the original order

Return type:

Union[pdarray,Strings,Categorical]

Raises:
  • ValueError – Raised if arrays is empty or if 1..n pdarrays have differing dtypes

  • TypeError – Raised if arrays is not a pdarrays or Strings python Sequence such as a list or tuple

  • RuntimeError – Raised if 1..n array elements are dtypes for which concatenate has not been implemented.

Examples

>>> ak.concatenate([ak.array([1, 2, 3]), ak.array([4, 5, 6])])
array([1, 2, 3, 4, 5, 6])
>>> ak.concatenate([ak.array([True,False,True]),ak.array([False,True,True])])
array([True, False, True, False, True, True])
>>> ak.concatenate([ak.array(['one','two']),ak.array(['three','four','five'])])
array(['one', 'two', 'three', 'four', 'five'])
arkouda.convert_if_categorical(values)[source]

Convert a Categorical array to Strings for display

arkouda.corr(x: pdarray, y: pdarray) numpy.float64[source]

Return the correlation between x and y

Parameters:
  • x (pdarray) – One of the pdarrays used to calculate correlation

  • y (pdarray) – One of the pdarrays used to calculate correlation

Returns:

The scalar correlation of the two pdarrays

Return type:

np.float64

Raises:
  • TypeError – Raised if x or y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

See also

std, cov

Notes

The correlation is calculated by cov(x, y) / (x.std(ddof=1) * y.std(ddof=1))

arkouda.cos(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise cosine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the cosine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing cosine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.cosh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise hyperbolic cosine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the hyperbolic cosine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing hyperbolic cosine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.count_nonzero(pda)[source]

Compute the nonzero count of a given array. 1D case only, for now.

Parameters:

pda (pdarray) – The input data, in pdarray form, numeric, bool, or str

Returns:

The nonzero count of the entire pdarray

Return type:

np.int64

Examples

>>> pda = ak.array([0,4,7,8,1,3,5,2,-1])
>>> ak.count_nonzero(pda)
9
>>> pda = ak.array([False,True,False,True,False])
>>> ak.count_nonzero(pda)
3
>>> pda = ak.array(["hello","","there"])
>>> ak.count_nonzero(pda)
2
arkouda.cov(x: pdarray, y: pdarray) numpy.float64[source]

Return the covariance of x and y

Parameters:
  • x (pdarray) – One of the pdarrays used to calculate covariance

  • y (pdarray) – One of the pdarrays used to calculate covariance

Returns:

The scalar covariance of the two pdarrays

Return type:

np.float64

Raises:
  • TypeError – Raised if x or y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

See also

mean, var

Notes

The covariance is calculated by cov = ((x - x.mean()) * (y - y.mean())).sum() / (x.size - 1).

arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray[source]

Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize

Returns:

A pdarray with the same attributes and data as the pdarray; on GPU

Return type:

pdarray

Raises:
  • ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance

  • RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance

arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray[source]

Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize

Returns:

A pdarray with the same attributes and data as the pdarray; on GPU

Return type:

pdarray

Raises:
  • ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance

  • RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance

arkouda.create_pdarray(repMsg: str, max_bits=None) pdarray[source]

Return a pdarray instance pointing to an array created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) – space-delimited string containing the pdarray name, datatype, size dimension, shape,and itemsize

Returns:

A pdarray with the same attributes and data as the pdarray; on GPU

Return type:

pdarray

Raises:
  • ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance

  • RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance

arkouda.create_sparray(repMsg: str, max_bits=None) sparray[source]

Return a sparray instance pointing to an array created by the arkouda server. The user should not call this function directly.

Parameters:

repMsg (str) – space-delimited string containing the sparray name, datatype, size dimension, shape,and itemsize

Returns:

A sparray with the same attributes as on the server

Return type:

sparray

Raises:
  • ValueError – If there’s an error in parsing the repMsg parameter into the six values needed to create the pdarray instance

  • RuntimeError – Raised if a server-side error is thrown in the process of creating the pdarray instance

arkouda.create_sparse_matrix(size: int, rows: arkouda.pdarrayclass.pdarray, cols: arkouda.pdarrayclass.pdarray, vals: arkouda.pdarrayclass.pdarray, layout: str) arkouda.sparrayclass.sparray[source]

Create a sparse matrix from three pdarrays representing the row indices, column indices, and values of the non-zero elements of the matrix.

Parameters:
  • rows (pdarray) – The row indices of the non-zero elements

  • cols (pdarray) – The column indices of the non-zero elements

  • vals (pdarray) – The values of the non-zero elements

Returns:

A sparse matrix with the specified row and column indices and values

Return type:

sparray

class arkouda.csingle(value)

Bases: numpy.complexfloating

Complex number type composed of two single-precision floating-point

numbers.

Character code:

'F'

Canonical name:

numpy.csingle

Alias:

numpy.singlecomplex

Alias on this platform (Linux x86_64):

numpy.complex64: Complex number type composed of 2 32-bit-precision floating-point numbers.

arkouda.ctz(pda: pdarray) pdarray[source]

Count trailing zeros for each integer in an array.

Parameters:

pda (pdarray, int64, uint64, bigint) – Input array (must be integral).

Returns:

lz – The number of trailing zeros of each element.

Return type:

pdarray

Notes

ctz(0) is defined to be zero.

Raises:

TypeError – If input array is not int64, uint64, or bigint

Examples

>>> A = ak.arange(10)
>>> ak.ctz(A)
array([0, 0, 1, 0, 2, 0, 1, 0, 3, 0])
arkouda.cumprod(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the cumulative product over the array.

The product is inclusive, such that the i th element of the result is the product of elements up to and including i.

Parameters:

pda (pdarray)

Returns:

A pdarray containing cumulative products for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.cumprod(ak.arange(1,5))
array([1, 2, 6, 24]))
>>> ak.cumprod(ak.uniform(5,1.0,5.0))
array([1.5728783400481925, 7.0472855509390593, 33.78523998586553,
       134.05309592737584, 450.21589865655358])
arkouda.cumsum(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the cumulative sum over the array.

The sum is inclusive, such that the i th element of the result is the sum of elements up to and including i.

Parameters:

pda (pdarray)

Returns:

A pdarray containing cumulative sums for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.cumsum(ak.arange([1,5]))
array([1, 3, 6])
>>> ak.cumsum(ak.uniform(5,1.0,5.0))
array([3.1598310770203937, 5.4110385860243131, 9.1622479306453748,
       12.710615785506533, 13.945880905466208])
>>> ak.cumsum(ak.randint(0, 1, 5, dtype=ak.bool_))
array([0, 1, 1, 2, 3])
arkouda.cumsum(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the cumulative sum over the array.

The sum is inclusive, such that the i th element of the result is the sum of elements up to and including i.

Parameters:

pda (pdarray)

Returns:

A pdarray containing cumulative sums for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.cumsum(ak.arange([1,5]))
array([1, 3, 6])
>>> ak.cumsum(ak.uniform(5,1.0,5.0))
array([3.1598310770203937, 5.4110385860243131, 9.1622479306453748,
       12.710615785506533, 13.945880905466208])
>>> ak.cumsum(ak.randint(0, 1, 5, dtype=ak.bool_))
array([0, 1, 1, 2, 3])
arkouda.cumsum(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the cumulative sum over the array.

The sum is inclusive, such that the i th element of the result is the sum of elements up to and including i.

Parameters:

pda (pdarray)

Returns:

A pdarray containing cumulative sums for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.cumsum(ak.arange([1,5]))
array([1, 3, 6])
>>> ak.cumsum(ak.uniform(5,1.0,5.0))
array([3.1598310770203937, 5.4110385860243131, 9.1622479306453748,
       12.710615785506533, 13.945880905466208])
>>> ak.cumsum(ak.randint(0, 1, 5, dtype=ak.bool_))
array([0, 1, 1, 2, 3])
arkouda.date_operators(cls)[source]
arkouda.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, inclusive='both', **kwargs)[source]

Creates a fixed frequency Datetime range. Alias for ak.Datetime(pd.date_range(args)). Subject to size limit imposed by client.maxTransferBytes.

Parameters:
  • start (str or datetime-like, optional) – Left bound for generating dates.

  • end (str or datetime-like, optional) – Right bound for generating dates.

  • periods (int, optional) – Number of periods to generate.

  • freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See timeseries.offset_aliases for a list of frequency aliases.

  • tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.

  • normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.

  • name (str, default None) – Name of the resulting DatetimeIndex.

  • closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default). Deprecated

  • inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.

  • **kwargs – For compatibility. Has no effect on the result.

Returns:

rng

Return type:

DatetimeIndex

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting DatetimeIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

arkouda.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, inclusive='both', **kwargs)[source]

Creates a fixed frequency Datetime range. Alias for ak.Datetime(pd.date_range(args)). Subject to size limit imposed by client.maxTransferBytes.

Parameters:
  • start (str or datetime-like, optional) – Left bound for generating dates.

  • end (str or datetime-like, optional) – Right bound for generating dates.

  • periods (int, optional) – Number of periods to generate.

  • freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See timeseries.offset_aliases for a list of frequency aliases.

  • tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive.

  • normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.

  • name (str, default None) – Name of the resulting DatetimeIndex.

  • closed ({None, 'left', 'right'}, optional) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None, the default). Deprecated

  • inclusive ({"both", "neither", "left", "right"}, default "both") – Include boundaries. Whether to set each bound as closed or open.

  • **kwargs – For compatibility. Has no effect on the result.

Returns:

rng

Return type:

DatetimeIndex

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting DatetimeIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

class arkouda.datetime64(value)

Bases: numpy.generic

If created from a 64-bit integer, it represents an offset from

1970-01-01T00:00:00. If created from string, the string can be in ISO 8601 date or datetime format.

>>> np.datetime64(10, 'Y')
numpy.datetime64('1980')
>>> np.datetime64('1980', 'Y')
numpy.datetime64('1980')
>>> np.datetime64(10, 'D')
numpy.datetime64('1970-01-11')

See arrays.datetime for more information.

Character code:

'M'

arkouda.deg2rad(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Converts angles element-wise from degrees to radians.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be converted from degrees to radians. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing an angle converted to radians, from degrees, for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.delete(arr: arkouda.pdarrayclass.pdarray, obj: arkouda.pdarrayclass.pdarray | slice | int, axis: int | None = None) arkouda.pdarrayclass.pdarray[source]

Return a copy of ‘arr’ with elements along the specified axis removed.

Parameters:
  • arr (pdarray) – The array to remove elements from

  • obj (Union[pdarray, slice, int]) – The indices to remove from ‘arr’. If obj is a pdarray, it must have an integer dtype.

  • axis (Optional[int], optional) – The axis along which to remove elements. If None, the array will be flattened before removing elements. Defaults to None.

Returns:

A copy of ‘arr’ with elements removed

Return type:

pdarray

arkouda.deprecate(*args, **kwargs)[source]

Issues a DeprecationWarning, adds warning to old_name’s docstring, rebinds old_name.__name__ and returns the new function object.

This function may also be used as a decorator.

Parameters:
  • func (function) – The function to be deprecated.

  • old_name (str, optional) – The name of the function to be deprecated. Default is None, in which case the name of func is used.

  • new_name (str, optional) – The new name for the function. Default is None, in which case the deprecation message is that old_name is deprecated. If given, the deprecation message is that old_name is deprecated and new_name should be used instead.

  • message (str, optional) – Additional explanation of the deprecation. Displayed in the docstring after the warning.

Returns:

old_func – The deprecated function.

Return type:

function

Examples

Note that olduint returns a value after printing Deprecation Warning:

>>> olduint = np.deprecate(np.uint)
DeprecationWarning: `uint64` is deprecated! # may vary
>>> olduint(6)
6
arkouda.deprecate_with_doc(msg)[source]

Deprecates a function and includes the deprecation in its docstring.

This function is used as a decorator. It returns an object that can be used to issue a DeprecationWarning, by passing the to-be decorated function as argument, this adds warning to the to-be decorated function’s docstring and returns the new function object.

See also

deprecate

Decorate a function such that it issues a DeprecationWarning

Parameters:

msg (str) – Additional explanation of the deprecation. Displayed in the docstring after the warning.

Returns:

obj

Return type:

object

arkouda.disableVerbose(logLevel: LogLevel = LogLevel.INFO) None[source]

Disables verbose logging (DEBUG log level) for all ArkoudaLoggers, setting the log level for each to the logLevel parameter

Parameters:

logLevel (LogLevel) – The new log level, defaultts to LogLevel.INFO

Raises:

TypeError – Raised if logLevel is not a LogLevel enum

arkouda.disp(mesg, device=None, linefeed=True)[source]

Display a message on a device.

Parameters:
  • mesg (str) – Message to display.

  • device (object) – Device to write message. If None, defaults to sys.stdout which is very similar to print. device needs to have write() and flush() methods.

  • linefeed (bool, optional) – Option whether to print a line feed or not. Defaults to True.

Raises:

AttributeError – If device does not have a write() or flush() method.

Examples

Besides sys.stdout, a file-like object can also be used as it has both required methods:

>>> from io import StringIO
>>> buf = StringIO()
>>> np.disp(u'"Display" in a file', device=buf)
>>> buf.getvalue()
'"Display" in a file\n'
arkouda.divmod(x: arkouda.numpy.dtypes.numeric_scalars | pdarray, y: arkouda.numpy.dtypes.numeric_scalars | pdarray, where: arkouda.numpy.dtypes.bool_scalars | pdarray = True) Tuple[pdarray, pdarray][source]
Parameters:
  • x (numeric_scalars(float_scalars, int_scalars) or pdarray) – The dividend array, the values that will be the numerator of the floordivision and will be acted on by the bases for modular division.

  • y (numeric_scalars(float_scalars, int_scalars) or pdarray) – The divisor array, the values that will be the denominator of the division and will be the bases for the modular division.

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be divided using floor and modular division. Elsewhere, it will retain its original value. Default set to True.

Returns:

Returns a tuple that contains quotient and remainder of the division

Return type:

(pdarray, pdarray)

Raises:
  • TypeError – At least one entry must be a pdarray

  • ValueError – If both inputs are both pdarrays, their size must match

  • ZeroDivisionError – No entry in y is allowed to be 0, to prevent division by zero

Notes

The div is calculated by x // y The mod is calculated by x % y

Examples

>>> x = ak.arange(5, 10)
>>> y = ak.array([2, 1, 4, 5, 8])
>>> ak.divmod(x,y)
(array([2 6 1 1 1]), array([1 0 3 3 1]))
>>> ak.divmod(x,y, x % 2 == 0)
(array([5 6 7 1 9]), array([5 0 7 3 9]))
arkouda.dot(pda1: numpy.int64 | numpy.float64 | numpy.uint64 | pdarray, pda2: numpy.int64 | numpy.float64 | numpy.uint64 | pdarray) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Returns the sum of the elementwise product of two arrays of the same size (the dot product) or the product of a singleton element and an array.

Parameters:
Returns:

The sum of the elementwise product pda1 and pda2 or the product of a singleton element and an array.

Return type:

Union[numeric_scalars, pdarray]

Raises:

ValueError – Raised if the size of pda1 is not the same as pda2

Examples

>>> x = ak.array([2, 3])
>>> y = ak.array([4, 5])
>>> ak.dot(x,y)
23
>>> ak.dot(x,2)
array([4 6])
class arkouda.double(value)

Bases: numpy.floating

Double-precision floating-point number type, compatible with Python float

and C double.

Character code:

'd'

Canonical name:

numpy.double

Alias:

numpy.float_

Alias on this platform (Linux x86_64):

numpy.float64: 64-bit precision floating-point number type: sign bit, 11 bits exponent, 52 bits mantissa.

as_integer_ratio(*args, **kwargs)

double.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.double(10.0).as_integer_ratio()
(10, 1)
>>> np.double(0.0).as_integer_ratio()
(0, 1)
>>> np.double(-.25).as_integer_ratio()
(-1, 4)
fromhex(string, /)

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex(/)

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
is_integer(*args, **kwargs)

double.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.double(-2.0).is_integer()
True
>>> np.double(3.2).is_integer()
False
arkouda.dtype(x)[source]
arkouda.e: float
arkouda.enableVerbose() None[source]

Enables verbose logging (DEBUG log level) for all ArkoudaLoggers

arkouda.euler_gamma: float
arkouda.exp(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise exponential of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing exponential values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.exp(ak.arange(1,5))
array([2.7182818284590451, 7.3890560989306504, 20.085536923187668, 54.598150033144236])
>>> ak.exp(ak.uniform(5,1.0,5.0))
array([11.84010843172504, 46.454368507659211, 5.5571769623557188,
       33.494295836924771, 13.478894913238722])
arkouda.expm1(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise exponential of the array minus one.

Parameters:

pda (pdarray)

Returns:

A pdarray containing exponential values of the input array elements minus one

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.exp1m(ak.arange(1,5))
array([1.7182818284590451, 6.3890560989306504, 19.085536923187668, 53.598150033144236])
>>> ak.exp1m(ak.uniform(5,1.0,5.0))
array([10.84010843172504, 45.454368507659211, 4.5571769623557188,
       32.494295836924771, 12.478894913238722])
arkouda.export(read_path: str, dataset_name: str = 'ak_data', write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Export data from Arkouda file (Parquet/HDF5) to Pandas object or file formatted to be readable by Pandas

Parameters:
  • read_path (str) – path to file where arkouda data is stored.

  • dataset_name (str) – name to store dataset under

  • index (bool) – Default False. When True, maintain the indexes loaded from the pandas file

  • write_file (str, optional) – path to file to write pandas formatted data to. Only write the file if this is set

  • return_obj (bool, optional) – Default True. When True return the Pandas DataFrame object, otherwise return None

Raises:

RuntimeError

  • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.import_data

Notes

  • If Arkouda file is exported for pandas, the format will not change. This mean parquet files will remain parquet and hdf5 will remain hdf5.

  • Export can only be performed from hdf5 or parquet files written by Arkouda. The result will be the same file type, but formatted to be read by Pandas.

arkouda.eye(rows: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64, cols: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64, diag: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 = 0, dt: type = numpy.int64)[source]

Return a pdarray with zeros everywhere except along a diagonal, which is all ones. The matrix need not be square.

Parameters:
  • rows (int_scalars)

  • cols (int_scalars)

  • diag (int_scalars) – if diag = 0, zeros start at element [0,0] and proceed along diagonal if diag > 0, zeros start at element [0,diag] and proceed along diagonal if diag < 0, zeros start at element [diag,0] and proceed along diagonal etc.

Returns:

an array of zeros with ones along the specified diagonal

Return type:

pdarray

Examples

>>> ak.eye(rows=4,cols=4,diag=0,dt=ak.int64)
array([array([1 0 0 0]) array([0 1 0 0]) array([0 0 1 0]) array([0 0 0 1])])
>>> ak.eye(rows=3,cols=3,diag=1,dt=ak.float64)
array([array([0.00000000000000000 1.00000000000000000 0.00000000000000000])
array([0.00000000000000000  0.00000000000000000 1.00000000000000000])
array([0.00000000000000000 0.00000000000000000 0.00000000000000000])])
>>> ak.eye(rows=4,cols=4,diag=-1,dt=ak.bool_)
array([array([False False False False]) array([True False False False])
array([False True False False]) array([False False True False])]

Notes

if rows = cols and diag = 0, the result is an identity matrix Server returns an error if rank of pda < 2

arkouda.find(query, space, all_occurrences=False, remove_missing=False)[source]

Return indices of query items in a search list of items.

Parameters:
  • query ((sequence of) array-like) – The items to search for. If multiple arrays, each “row” is an item.

  • space ((sequence of) array-like) – The set of items in which to search. Must have same shape/dtype as query.

  • all_occurrences (bool) – When duplicate terms are present in search space, if all_occurrences is True, return all occurrences found as a SegArray, otherwise return only the first occurrences as a pdarray. Defaults to only finding the first occurrence. Finding all occurrences is not yet supported on sequences of arrays

  • remove_missing (bool) – If all_occurrences is True, remove_missing is automatically enabled. If False, return -1 for any items in query not found in space. If True, remove these and only return indices of items that are found.

Returns:

indices – For each item in query, its index in space. If all_occurrences is False, the return will be a pdarray of the first index where each value in the query appears in the space. If all_occurrences is True, the return will be a SegArray containing every index where each value in the query appears in the space. If all_occurrences is True, remove_missing is automatically enabled. If remove_missing is True, exclude missing values, otherwise return -1.

Return type:

pdarray or SegArray

Examples

>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]

# find with defaults (all_occurrences and remove_missing both False) >>> ak.find(arr1, arr2) array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])

# set remove_missing to True, only difference from default # is missing values are excluded >>> ak.find(arr1, arr2, remove_missing=True)

array([0 1 2 5 8 5 11 5 0])

# set both remove_missing and all_occurrences to True, missing values # will be empty segments >>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list() [[],

[], [], [0, 4], [1, 3, 10], [], [], [], [2, 6, 12, 13], [], [5, 7], [], [8, 9, 14], [], [5, 7], [], [], [11, 15], [5, 7], [0, 4]]

class arkouda.finfo

finfo(dtype)

Machine limits for floating point types.

bits

The number of bits occupied by the type.

Type:

int

dtype

Returns the dtype for which finfo returns information. For complex input, the returned dtype is the associated float* dtype for its real and complex components.

Type:

dtype

eps

The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, eps = 2**-52, approximately 2.22e-16.

Type:

float

epsneg

The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, epsneg = 2**-53, approximately 1.11e-16.

Type:

float

iexp

The number of bits in the exponent portion of the floating point representation.

Type:

int

machep

The exponent that yields eps.

Type:

int

max

The largest representable number.

Type:

floating point number of the appropriate type

maxexp

The smallest positive power of the base (2) that causes overflow.

Type:

int

min

The smallest representable number, typically -max.

Type:

floating point number of the appropriate type

minexp

The most negative power of the base (2) consistent with there being no leading 0’s in the mantissa.

Type:

int

negep

The exponent that yields epsneg.

Type:

int

nexp

The number of bits in the exponent including its sign and bias.

Type:

int

nmant

The number of bits in the mantissa.

Type:

int

precision

The approximate number of decimal digits to which this kind of float is precise.

Type:

int

resolution

The approximate decimal resolution of this type, i.e., 10**-precision.

Type:

floating point number of the appropriate type

tiny

An alias for smallest_normal, kept for backwards compatibility.

Type:

float

smallest_normal

The smallest positive floating point number with 1 as leading bit in the mantissa following IEEE-754 (see Notes).

Type:

float

smallest_subnormal

The smallest positive floating point number with 0 as leading bit in the mantissa following IEEE-754.

Type:

float

Parameters:

dtype (float, dtype, or instance) – Kind of floating point or complex floating point data-type about which to get information.

See also

iinfo

The equivalent for integer data types.

spacing

The distance between a value and the nearest adjacent number

nextafter

The next floating point value after x1 towards x2

Notes

For developers of NumPy: do not instantiate this at the module level. The initial calculation of these parameters is expensive and negatively impacts import times. These objects are cached, so calling finfo() repeatedly inside your functions is not a problem.

Note that smallest_normal is not actually the smallest positive representable value in a NumPy floating point type. As in the IEEE-754 standard [1]_, NumPy floating point types make use of subnormal numbers to fill the gap between 0 and smallest_normal. However, subnormal numbers may have significantly reduced precision [2].

This function can also be used for complex data types as well. If used, the output will be the same as the corresponding real float type (e.g. numpy.finfo(numpy.csingle) is the same as numpy.finfo(numpy.single)). However, the output is true for the real and imaginary components.

References

Examples

>>> np.finfo(np.float64).dtype
dtype('float64')
>>> np.finfo(np.complex64).dtype
dtype('float32')
property smallest_normal

Return the value for the smallest normal.

Returns:

smallest_normal – Value for the smallest normal.

Return type:

float

Warns:

UserWarning – If the calculated value for the smallest normal is requested for double-double.

property tiny

Return the value for tiny, alias of smallest_normal.

Returns:

tiny – Value for the smallest normal, alias of smallest_normal.

Return type:

float

Warns:

UserWarning – If the calculated value for the smallest normal is requested for double-double.

class arkouda.flexible(value)

Bases: numpy.generic

Abstract base class of all scalar types without predefined length.

The actual size of these types depends on the specific np.dtype instantiation.

arkouda.flip(x: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical, /, *, axis: int | Tuple[int, Ellipsis] | NoneType = None) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.categorical.Categorical[source]

Reverse an array’s values along a particular axis or axes.

Parameters:
  • x (pdarray, Strings, or Categorical) –

    Reverse the order of elements in an array along the given axis.

    The shape of the array is preserved, but the elements are reordered.

  • axis (int or Tuple[int, ...], optional) – The axis or axes along which to flip the array. If None, flip the array along all axes.

Returns:

An array with the entries of axis reversed.

Return type:

pdarray, Strings, or Categorical

Note

This differs from numpy as it actually reverses the data, rather than presenting a view.

class arkouda.float16(value)

Bases: numpy.floating

Half-precision floating-point number type.

Character code:

'e'

Canonical name:

numpy.half

Alias on this platform (Linux x86_64):

numpy.float16: 16-bit-precision floating-point number type: sign bit, 5 bits exponent, 10 bits mantissa.

as_integer_ratio(*args, **kwargs)

half.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.half(10.0).as_integer_ratio()
(10, 1)
>>> np.half(0.0).as_integer_ratio()
(0, 1)
>>> np.half(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

half.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.half(-2.0).is_integer()
True
>>> np.half(3.2).is_integer()
False
class arkouda.float32(value)

Bases: numpy.floating

Single-precision floating-point number type, compatible with C float.

Character code:

'f'

Canonical name:

numpy.single

Alias on this platform (Linux x86_64):

numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

as_integer_ratio(*args, **kwargs)

single.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.single(10.0).as_integer_ratio()
(10, 1)
>>> np.single(0.0).as_integer_ratio()
(0, 1)
>>> np.single(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

single.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.single(-2.0).is_integer()
True
>>> np.single(3.2).is_integer()
False
class arkouda.float64(value)

Bases: numpy.floating

Double-precision floating-point number type, compatible with Python float

and C double.

Character code:

'd'

Canonical name:

numpy.double

Alias:

numpy.float_

Alias on this platform (Linux x86_64):

numpy.float64: 64-bit precision floating-point number type: sign bit, 11 bits exponent, 52 bits mantissa.

as_integer_ratio(*args, **kwargs)

double.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.double(10.0).as_integer_ratio()
(10, 1)
>>> np.double(0.0).as_integer_ratio()
(0, 1)
>>> np.double(-.25).as_integer_ratio()
(-1, 4)
fromhex(string, /)

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex(/)

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
is_integer(*args, **kwargs)

double.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.double(-2.0).is_integer()
True
>>> np.double(3.2).is_integer()
False
class arkouda.float_(value)

Bases: numpy.floating

Double-precision floating-point number type, compatible with Python float

and C double.

Character code:

'd'

Canonical name:

numpy.double

Alias:

numpy.float_

Alias on this platform (Linux x86_64):

numpy.float64: 64-bit precision floating-point number type: sign bit, 11 bits exponent, 52 bits mantissa.

as_integer_ratio(*args, **kwargs)

double.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.double(10.0).as_integer_ratio()
(10, 1)
>>> np.double(0.0).as_integer_ratio()
(0, 1)
>>> np.double(-.25).as_integer_ratio()
(-1, 4)
fromhex(string, /)

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex(/)

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
is_integer(*args, **kwargs)

double.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.double(-2.0).is_integer()
True
>>> np.double(3.2).is_integer()
False
class arkouda.float_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.floating(value)

Bases: numpy.inexact

Abstract base class of all floating-point scalar types.

arkouda.floor(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise floor of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing floor values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.floor(ak.linspace(1.1,5.5,5))
array([1, 2, 3, 4, 5])
arkouda.fmod(dividend: pdarray | arkouda.numpy.dtypes.numeric_scalars, divisor: pdarray | arkouda.numpy.dtypes.numeric_scalars) pdarray[source]

Returns the element-wise remainder of division.

It is equivalent to np.fmod, the remainder has the same sign as the dividend.

Parameters:
  • dividend (numeric scalars or pdarray) – The array being acted on by the bases for the modular division.

  • divisor (numeric scalars or pdarray) – The array that will be the bases for the modular division.

Returns:

Returns an array that contains the element-wise remainder of division.

Return type:

pdarray

arkouda.format_float_positional(x, precision=None, unique=True, fractional=True, trim='k', sign=False, pad_left=None, pad_right=None, min_digits=None)

Format a floating-point scalar as a decimal string in positional notation.

Provides control over rounding, trimming and padding. Uses and assumes IEEE unbiased rounding. Uses the “Dragon4” algorithm.

Parameters:
  • x (python float or numpy floating scalar) – Value to format.

  • precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is True, but must be an integer if unique is False.

  • unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest representation which uniquely identifies the floating-point number from other values of the same type, by judicious rounding. If precision is given fewer digits than necessary can be printed, or if min_digits is given more can be printed, in which cases the last digit is rounded with unbiased rounding. If False, digits are generated as if printing an infinite-precision value and stopping after precision digits, rounding the remaining value with unbiased rounding

  • fractional (boolean, optional) – If True, the cutoffs of precision and min_digits refer to the total number of digits after the decimal point, including leading zeros. If False, precision and min_digits refer to the total number of significant digits, before or after the decimal point, ignoring leading zeros.

  • trim (one of 'k', '.', '0', '-', optional) –

    Controls post-processing trimming of trailing digits, as follows:

    • ’k’ : keep trailing zeros, keep decimal point (no trimming)

    • ’.’ : trim all trailing zeros, leave decimal point

    • ’0’ : trim all but the zero before the decimal point. Insert the zero if it is missing.

    • ’-’ : trim trailing zeros and any trailing decimal point

  • sign (boolean, optional) – Whether to show the sign for positive values.

  • pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that many characters are to the left of the decimal point.

  • pad_right (non-negative integer, optional) – Pad the right side of the string with whitespace until at least that many characters are to the right of the decimal point.

  • min_digits (non-negative integer or None, optional) –

    Minimum number of digits to print. Only has an effect if unique=True in which case additional digits past those necessary to uniquely identify the value may be printed, rounding the last additional digit.

    – versionadded:: 1.21.0

Returns:

rep – The string representation of the floating point value

Return type:

string

Examples

>>> np.format_float_positional(np.float32(np.pi))
'3.1415927'
>>> np.format_float_positional(np.float16(np.pi))
'3.14'
>>> np.format_float_positional(np.float16(0.3))
'0.3'
>>> np.format_float_positional(np.float16(0.3), unique=False, precision=10)
'0.3000488281'
arkouda.format_float_scientific(x, precision=None, unique=True, trim='k', sign=False, pad_left=None, exp_digits=None, min_digits=None)

Format a floating-point scalar as a decimal string in scientific notation.

Provides control over rounding, trimming and padding. Uses and assumes IEEE unbiased rounding. Uses the “Dragon4” algorithm.

Parameters:
  • x (python float or numpy floating scalar) – Value to format.

  • precision (non-negative integer or None, optional) – Maximum number of digits to print. May be None if unique is True, but must be an integer if unique is False.

  • unique (boolean, optional) – If True, use a digit-generation strategy which gives the shortest representation which uniquely identifies the floating-point number from other values of the same type, by judicious rounding. If precision is given fewer digits than necessary can be printed. If min_digits is given more can be printed, in which cases the last digit is rounded with unbiased rounding. If False, digits are generated as if printing an infinite-precision value and stopping after precision digits, rounding the remaining value with unbiased rounding

  • trim (one of 'k', '.', '0', '-', optional) –

    Controls post-processing trimming of trailing digits, as follows:

    • ’k’ : keep trailing zeros, keep decimal point (no trimming)

    • ’.’ : trim all trailing zeros, leave decimal point

    • ’0’ : trim all but the zero before the decimal point. Insert the zero if it is missing.

    • ’-’ : trim trailing zeros and any trailing decimal point

  • sign (boolean, optional) – Whether to show the sign for positive values.

  • pad_left (non-negative integer, optional) – Pad the left side of the string with whitespace until at least that many characters are to the left of the decimal point.

  • exp_digits (non-negative integer, optional) – Pad the exponent with zeros until it contains at least this many digits. If omitted, the exponent will be at least 2 digits.

  • min_digits (non-negative integer or None, optional) –

    Minimum number of digits to print. This only has an effect for unique=True. In that case more digits than necessary to uniquely identify the value may be printed and rounded unbiased.

    – versionadded:: 1.21.0

Returns:

rep – The string representation of the floating point value

Return type:

string

Examples

>>> np.format_float_scientific(np.float32(np.pi))
'3.1415927e+00'
>>> s = np.float32(1.23e24)
>>> np.format_float_scientific(s, unique=False, precision=15)
'1.230000071797338e+24'
>>> np.format_float_scientific(s, exp_digits=4)
'1.23e+0024'
class arkouda.format_parser

Class to convert formats, names, titles description to a dtype.

After constructing the format_parser object, the dtype attribute is the converted data-type: dtype = format_parser(formats, names, titles).dtype

dtype

The converted data-type.

Type:

dtype

Parameters:
  • formats (str or list of str) – The format description, either specified as a string with comma-separated format descriptions in the form 'f8, i4, a5', or a list of format description strings in the form ['f8', 'i4', 'a5'].

  • names (str or list/tuple of str) – The field names, either specified as a comma-separated string in the form 'col1, col2, col3', or as a list or tuple of strings in the form ['col1', 'col2', 'col3']. An empty list can be used, in that case default field names (‘f0’, ‘f1’, …) are used.

  • titles (sequence) – Sequence of title strings. An empty list can be used to leave titles out.

  • aligned (bool, optional) – If True, align the fields by padding as the C-compiler would. Default is False.

  • byteorder (str, optional) – If specified, all the fields will be changed to the provided byte-order. Otherwise, the default byte-order is used. For all available string specifiers, see dtype.newbyteorder.

See also

dtype, typename, sctype2char

Examples

>>> np.format_parser(['<f8', '<i4', '<a5'], ['col1', 'col2', 'col3'],
...                  ['T1', 'T2', 'T3']).dtype
dtype([(('T1', 'col1'), '<f8'), (('T2', 'col2'), '<i4'), (('T3', 'col3'), 'S5')])

names and/or titles can be empty lists. If titles is an empty list, titles will simply not appear. If names is empty, default field names will be used.

>>> np.format_parser(['f8', 'i4', 'a5'], ['col1', 'col2', 'col3'],
...                  []).dtype
dtype([('col1', '<f8'), ('col2', '<i4'), ('col3', '<S5')])
>>> np.format_parser(['<f8', '<i4', '<a5'], [], []).dtype
dtype([('f0', '<f8'), ('f1', '<i4'), ('f2', 'S5')])
arkouda.gen_ranges(starts, ends, stride=1, return_lengths=False)[source]

Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.

Parameters:
  • starts (pdarray, int64) – The start value of each range

  • ends (pdarray, int64) – The end value (exclusive) of each range

  • stride (int) – Difference between successive elements of each range

  • return_lengths (bool, optional) – Whether or not to return the lengths of each segment. Default False.

Returns:

  • segments (pdarray, int64) – The starting index of each range in the resulting array

  • ranges (pdarray, int64) – The actual ranges, flattened into a single array

  • lengths (pdarray, int64) – The lengths of each segment. Only returned if return_lengths=True.

arkouda.gen_ranges(starts, ends, stride=1, return_lengths=False)[source]

Generate a segmented array of variable-length, contiguous ranges between pairs of start- and end-points.

Parameters:
  • starts (pdarray, int64) – The start value of each range

  • ends (pdarray, int64) – The end value (exclusive) of each range

  • stride (int) – Difference between successive elements of each range

  • return_lengths (bool, optional) – Whether or not to return the lengths of each segment. Default False.

Returns:

  • segments (pdarray, int64) – The starting index of each range in the resulting array

  • ranges (pdarray, int64) – The actual ranges, flattened into a single array

  • lengths (pdarray, int64) – The lengths of each segment. Only returned if return_lengths=True.

arkouda.generic_concat(items, ordered=True)[source]
arkouda.getArkoudaLogger(name: str, handlers: List[logging.Handler] | None = None, logFormat: str | None = ArkoudaLogger.DEFAULT_LOG_FORMAT, logLevel: LogLevel | None = None) ArkoudaLogger[source]

A convenience method for instantiating an ArkoudaLogger that retrieves the logging level from the ARKOUDA_LOG_LEVEL env variable

Parameters:
  • name (str) – The name of the ArkoudaLogger

  • handlers (List[Handler]) – A list of logging.Handler objects, if None, a list consisting of one StreamHandler named ‘console-handler’ is generated and configured

  • logFormat (str) – The format for log messages, defaults to the following format: ‘[%(name)s] Line %(lineno)d %(levelname)s: %(message)s’

Return type:

ArkoudaLogger

Raises:

TypeError – Raised if either name or logFormat is not a str object or if handlers is not a list of str objects

Notes

Important note: if a list of 1..n logging.Handler objects is passed in, and dynamic changes to 1..n handlers is desired, set a name for each Handler object as follows: handler.name = <desired name>, which will enable retrieval and updates for the specified handler.

arkouda.get_byteorder(dt: np.dtype) str[source]

Get a concrete byteorder (turns ‘=’ into ‘<’ or ‘>’)

arkouda.get_callback(x)[source]
arkouda.get_columns(filenames: str | List[str], col_delim: str = ',', allow_errors: bool = False) List[str][source]

Get a list of column names from CSV file(s).

arkouda.get_datasets(filenames: str | List[str], allow_errors: bool = False, column_delim: str = ',', read_nested: bool = True) List[str][source]

Get the names of the datasets in the provide files

Parameters:
  • filenames (str or List[str]) – Name of the file/s from which to return datasets

  • allow_errors (bool) – Default: False Whether or not to allow errors while accessing datasets

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet Files.

Return type:

List[str] of names of the datasets

Raises:

RuntimeError

  • If no datasets are returned

Notes

  • This function currently supports HDF5 and Parquet formats.

  • Future updates to Parquet will deprecate this functionality on that format,

but similar support will be added for Parquet at that time. - If a list of files is provided, only the datasets in the first file will be returned

See also

ls

arkouda.get_filetype(filenames: str | List[str]) str[source]

Get the type of a file accessible to the server. Supported file types and possible return strings are ‘HDF5’ and ‘Parquet’.

Parameters:

filenames (Union[str, List[str]]) – A file or list of files visible to the arkouda server

Returns:

Type of the file returned as a string, either ‘HDF5’, ‘Parquet’ or ‘CSV

Return type:

str

Raises:

ValueError – Raised if filename is empty or contains only whitespace

Notes

  • When list provided, it is assumed that all files are the same type

  • CSV Files without the Arkouda Header are not supported

arkouda.get_null_indices(filenames: str | List[str], datasets: str | List[str] | None = None) arkouda.pdarrayclass.pdarray | Mapping[str, arkouda.pdarrayclass.pdarray][source]

Get null indices of a string column in a Parquet file.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read. Each dataset must be a string column. There is no default value for this function, the datasets to be read must be specified.

Returns:

Dictionary of {datasetName: pdarray}

Return type:

returns a dictionary of Arkouda pdarrays

Raises:
  • RuntimeError – Raised if one or more of the specified files cannot be opened.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

get_datasets, ls

arkouda.get_server_byteorder() str[source]

Get the server’s byteorder

class arkouda.half(value)

Bases: numpy.floating

Half-precision floating-point number type.

Character code:

'e'

Canonical name:

numpy.half

Alias on this platform (Linux x86_64):

numpy.float16: 16-bit-precision floating-point number type: sign bit, 5 bits exponent, 10 bits mantissa.

as_integer_ratio(*args, **kwargs)

half.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.half(10.0).as_integer_ratio()
(10, 1)
>>> np.half(0.0).as_integer_ratio()
(0, 1)
>>> np.half(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

half.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.half(-2.0).is_integer()
True
>>> np.half(3.2).is_integer()
False
arkouda.hash(pda: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | SegArray | Categorical | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | SegArray | Categorical], full: bool = True) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray] | arkouda.pdarrayclass.pdarray[source]

Return an element-wise hash of the array or list of arrays.

Parameters:
  • pda (Union[pdarray, Strings, Segarray, Categorical],) – List[Union[pdarray, Strings, Segarray, Categorical]]]

  • full (bool) – This is only used when a single pdarray is passed into hash By default, a 128-bit hash is computed and returned as two int64 arrays. If full=False, then a 64-bit hash is computed and returned as a single int64 array.

Returns:

If full=True or a list of pdarrays is passed, a 2-tuple of pdarrays containing the high and low 64 bits of each hash, respectively. If full=False and a single pdarray is passed, a single pdarray containing a 64-bit hash

Return type:

hashes

Raises:

TypeError – Raised if the parameter is not a pdarray

Notes

In the case of a single pdarray being passed, this function uses the SIPhash algorithm, which can output either a 64-bit or 128-bit hash. However, the 64-bit hash runs a significant risk of collisions when applied to more than a few million unique values. Unless the number of unique values is known to be small, the 128-bit hash is strongly recommended.

Note that this hash should not be used for security, or for any cryptographic application. Not only is SIPhash not intended for such uses, but this implementation employs a fixed key for the hash, which makes it possible for an adversary with control over input to engineer collisions.

In the case of a list of pdrrays, Strings, Categoricals, or Segarrays being passed, a non-linear function must be applied to each array since hashes of subsequent arrays cannot be simply XORed because equivalent values will cancel each other out, hence we do a rotation by the ordinal of the array.

arkouda.hist_all(ak_df: arkouda.dataframe.DataFrame, cols: list = [])[source]

Create a grid plot histogramming all numeric columns in ak dataframe

Parameters:
  • ak_df (ak.DataFrame) – Full Arkouda DataFrame containing data to be visualized

  • cols (list) – (Optional) A specified list of columns to be plotted

Notes

This function displays the plot.

Examples

>>> import arkouda as ak
>>> from arkouda.plotting import hist_all
>>> ak_df = ak.DataFrame({"a": ak.array(np.random.randn(100)),
                          "b": ak.array(np.random.randn(100)),
                          "c": ak.array(np.random.randn(100)),
                          "d": ak.array(np.random.randn(100))
                          })
>>> hist_all(ak_df)
arkouda.histogram(pda: arkouda.pdarrayclass.pdarray, bins: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 = 10) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a histogram of evenly spaced bins over the range of an array.

Parameters:
  • pda (pdarray) – The values to histogram

  • bins (int_scalars) – The number of equal-size bins to use (default: 10)

Returns:

Bin edges and The number of values present in each bin

Return type:

(pdarray, Union[pdarray, int64 or float64])

Raises:
  • TypeError – Raised if the parameter is not a pdarray or if bins is not an int.

  • ValueError – Raised if bins < 1

  • NotImplementedError – Raised if pdarray dtype is bool or uint8

Notes

The bins are evenly spaced in the interval [pda.min(), pda.max()].

Examples

>>> import matplotlib.pyplot as plt
>>> A = ak.arange(0, 10, 1)
>>> nbins = 3
>>> h, b = ak.histogram(A, bins=nbins)
>>> h
array([3, 3, 4])
>>> b
array([0., 3., 6., 9.])

# To plot, export the left edges and the histogram to NumPy >>> plt.plot(b.to_ndarray()[::-1], h.to_ndarray())

arkouda.histogram(pda: arkouda.pdarrayclass.pdarray, bins: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 = 10) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute a histogram of evenly spaced bins over the range of an array.

Parameters:
  • pda (pdarray) – The values to histogram

  • bins (int_scalars) – The number of equal-size bins to use (default: 10)

Returns:

Bin edges and The number of values present in each bin

Return type:

(pdarray, Union[pdarray, int64 or float64])

Raises:
  • TypeError – Raised if the parameter is not a pdarray or if bins is not an int.

  • ValueError – Raised if bins < 1

  • NotImplementedError – Raised if pdarray dtype is bool or uint8

Notes

The bins are evenly spaced in the interval [pda.min(), pda.max()].

Examples

>>> import matplotlib.pyplot as plt
>>> A = ak.arange(0, 10, 1)
>>> nbins = 3
>>> h, b = ak.histogram(A, bins=nbins)
>>> h
array([3, 3, 4])
>>> b
array([0., 3., 6., 9.])

# To plot, export the left edges and the histogram to NumPy >>> plt.plot(b.to_ndarray()[::-1], h.to_ndarray())

arkouda.histogram2d(x: arkouda.pdarrayclass.pdarray, y: arkouda.pdarrayclass.pdarray, bins: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | Sequence[int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64] = 10) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Compute the bi-dimensional histogram of two data samples with evenly spaced bins

Parameters:
  • x (pdarray) – A pdarray containing the x coordinates of the points to be histogrammed.

  • y (pdarray) – A pdarray containing the y coordinates of the points to be histogrammed.

  • bins (int_scalars or [int, int] = 10) – The number of equal-size bins to use. If int, the number of bins for the two dimensions (nx=ny=bins). If [int, int], the number of bins in each dimension (nx, ny = bins). Defaults to 10

Returns:

  • hist (pdarray) – shape(nx, ny) The bi-dimensional histogram of samples x and y. Values in x are histogrammed along the first dimension and values in y are histogrammed along the second dimension.

  • x_edges (pdarray) – The bin edges along the first dimension.

  • y_edges (pdarray) – The bin edges along the second dimension.

Raises:
  • TypeError – Raised if x or y parameters are not pdarrays or if bins is not an int or (int, int).

  • ValueError – Raised if bins < 1

  • NotImplementedError – Raised if pdarray dtype is bool or uint8

See also

histogram

Notes

The x bins are evenly spaced in the interval [x.min(), x.max()] and y bins are evenly spaced in the interval [y.min(), y.max()].

Examples

>>> x = ak.arange(0, 10, 1)
>>> y = ak.arange(9, -1, -1)
>>> nbins = 3
>>> h, x_edges, y_edges = ak.histogram2d(x, y, bins=nbins)
>>> h
array([[0, 0, 3],
       [0, 2, 1],
       [3, 1, 0]])
>>> x_edges
array([0.0 3.0 6.0 9.0])
>>> x_edges
array([0.0 3.0 6.0 9.0])
arkouda.histogramdd(sample: Sequence[arkouda.pdarrayclass.pdarray], bins: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | Sequence[int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64] = 10) Tuple[arkouda.pdarrayclass.pdarray, Sequence[arkouda.pdarrayclass.pdarray]][source]

Compute the multidimensional histogram of data in sample with evenly spaced bins.

Parameters:
  • sample (Sequence[pdarray]) – A sequence of pdarrays containing the coordinates of the points to be histogrammed.

  • bins (int_scalars or Sequence[int_scalars] = 10) – The number of equal-size bins to use. If int, the number of bins for all dimensions (nx=ny=…=bins). If [int, int, …], the number of bins in each dimension (nx, ny, … = bins). Defaults to 10

Returns:

  • hist (pdarray) – shape(nx, ny, …, nd) The multidimensional histogram of pdarrays in sample. Values in first pdarray are histogrammed along the first dimension. Values in second pdarray are histogrammed along the second dimension and so on.

  • edges (List[pdarray]) – A list of pdarrays containing the bin edges for each dimension.

Raises:
  • ValueError – Raised if bins < 1

  • NotImplementedError – Raised if pdarray dtype is bool or uint8

See also

histogram

Notes

The bins for each dimension, m, are evenly spaced in the interval [m.min(), m.max()]

Examples

>>> x = ak.arange(0, 10, 1)
>>> y = ak.arange(9, -1, -1)
>>> z = ak.where(x % 2 == 0, x, y)
>>> h, edges = ak.histogramdd((x, y,z), bins=(2,2,5))
>>> h
array([[[0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1]],
[[1, 1, 1, 1, 1],

[0, 0, 0, 0, 0]]])

>>> edges
[array([0.0 4.5 9.0]),
 array([0.0 4.5 9.0]),
 array([0.0 1.6 3.2 4.8 6.4 8.0])]
class arkouda.iinfo

iinfo(type)

Machine limits for integer types.

bits

The number of bits occupied by the type.

Type:

int

dtype

Returns the dtype for which iinfo returns information.

Type:

dtype

min

The smallest integer expressible by the type.

Type:

int

max

The largest integer expressible by the type.

Type:

int

Parameters:

int_type (integer type, dtype, or instance) – The kind of integer data type to get information about.

See also

finfo

The equivalent for floating point data types.

Examples

With types:

>>> ii16 = np.iinfo(np.int16)
>>> ii16.min
-32768
>>> ii16.max
32767
>>> ii32 = np.iinfo(np.int32)
>>> ii32.min
-2147483648
>>> ii32.max
2147483647

With instances:

>>> ii32 = np.iinfo(np.int32(10))
>>> ii32.min
-2147483648
>>> ii32.max
2147483647
property max

Maximum value of given dtype.

property min

Minimum value of given dtype.

arkouda.import_data(read_path: str, write_file: str | None = None, return_obj: bool = True, index: bool = False)[source]

Import data from a file saved by Pandas (HDF5/Parquet) to Arkouda object and/or a file formatted to be read by Arkouda.

Parameters:
  • read_path (str) – path to file where pandas data is stored. This can be glob expression for parquet formats.

  • write_file (str, optional) – path to file to write arkouda formatted data to. Only write file if provided

  • return_obj (bool, optional) – Default True. When True return the Arkouda DataFrame object, otherwise return None

  • index (bool, optional) – Default False. When True, maintain the indexes loaded from the pandas file

Raises:
  • RuntimeWarning

    • Export attempted on Parquet file. Arkouda formatted Parquet files are readable by pandas.

  • RuntimeError

    • Unsupported file type

Returns:

When return_obj=True

Return type:

pd.DataFrame

See also

pandas.DataFrame.to_parquet, pandas.DataFrame.to_hdf, pandas.DataFrame.read_parquet, pandas.DataFrame.read_hdf, ak.export

Notes

  • Import can only be performed from hdf5 or parquet files written by pandas.

arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.groupbyclass.groupable[source]

Test whether each element of a 1-D array is also present in a second array.

Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.

Support multi-level – test membership of rows of a in the set of rows of b.

Parameters:
  • a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b

  • b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership

  • assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.

  • symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.

  • invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False. ak.in1d(a, b, invert=True) is equivalent to (but is faster than) ~ak.in1d(a, b).

Returns:

  • True for each row in a that is contained in b

  • Return Type

  • ———— – pdarray, bool

Notes

Only works for pdarrays of int64 dtype, float64, Strings, or Categorical

arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.groupbyclass.groupable[source]

Test whether each element of a 1-D array is also present in a second array.

Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.

Support multi-level – test membership of rows of a in the set of rows of b.

Parameters:
  • a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b

  • b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership

  • assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.

  • symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.

  • invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False. ak.in1d(a, b, invert=True) is equivalent to (but is faster than) ~ak.in1d(a, b).

Returns:

  • True for each row in a that is contained in b

  • Return Type

  • ———— – pdarray, bool

Notes

Only works for pdarrays of int64 dtype, float64, Strings, or Categorical

arkouda.in1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False, symmetric: bool = False, invert: bool = False) arkouda.groupbyclass.groupable[source]

Test whether each element of a 1-D array is also present in a second array.

Returns a boolean array the same length as pda1 that is True where an element of pda1 is in pda2 and False otherwise.

Support multi-level – test membership of rows of a in the set of rows of b.

Parameters:
  • a (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements for which to test membership in b

  • b (list of pdarrays, pdarray, Strings, or Categorical) – Rows are elements of the set in which to test membership

  • assume_unique (bool) – If true, assume rows of a and b are each unique and sorted. By default, sort and unique them explicitly.

  • symmetric (bool) – Return in1d(pda1, pda2), in1d(pda2, pda1) when pda1 and 2 are single items.

  • invert (bool, optional) – If True, the values in the returned array are inverted (that is, False where an element of pda1 is in pda2 and True otherwise). Default is False. ak.in1d(a, b, invert=True) is equivalent to (but is faster than) ~ak.in1d(a, b).

Returns:

  • True for each row in a that is contained in b

  • Return Type

  • ———— – pdarray, bool

Notes

Only works for pdarrays of int64 dtype, float64, Strings, or Categorical

arkouda.in1d_intervals(vals, intervals, symmetric=False)[source]

Test each value for membership in any of a set of half-open (pythonic) intervals.

Parameters:
  • vals (pdarray(int, float)) – Values to test for membership in intervals

  • intervals (2-tuple of pdarrays) – Non-overlapping, half-open intervals, as a tuple of (lower_bounds_inclusive, upper_bounds_exclusive)

  • symmetric (bool) – If True, also return boolean pdarray indicating which intervals contained one or more query values.

Returns:

  • pdarray(bool) – Array of same length as <vals>, True if corresponding value is included in any of the ranges defined by (low[i], high[i]) inclusive.

  • pdarray(bool) (if symmetric=True) – Array of same length as number of intervals, True if corresponding interval contains any of the values in <vals>.

Notes

First return array is equivalent to the following:

((vals >= intervals[0][0]) & (vals < intervals[1][0])) | ((vals >= intervals[0][1]) & (vals < intervals[1][1])) | … ((vals >= intervals[0][-1]) & (vals < intervals[1][-1]))

But much faster when testing many ranges.

Second (optional) return array is equivalent to:

((intervals[0] <= vals[0]) & (intervals[1] > vals[0])) | ((intervals[0] <= vals[1]) & (intervals[1] > vals[1])) | … ((intervals[0] <= vals[-1]) & (intervals[1] > vals[-1]))

But much faster when vals is non-trivial size.

arkouda.indexof1d(query: arkouda.groupbyclass.groupable, space: arkouda.groupbyclass.groupable) arkouda.pdarrayclass.pdarray[source]

Return indices of query items in a search list of items. Items not found will be excluded. When duplicate terms are present in search space return indices of all occurrences.

Parameters:
  • query ((sequence of) pdarray or Strings or Categorical) – The items to search for. If multiple arrays, each “row” is an item.

  • space ((sequence of) pdarray or Strings or Categorical) – The set of items in which to search. Must have same shape/dtype as query.

Returns:

indices – For each item in query, its index in space.

Return type:

pdarray, int64

Notes

This is an alias of ak.find(query, space, all_occurrences=True, remove_missing=True).values

Examples

>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
>>> ak.indexof1d(arr1, arr2)
array([0 4 1 3 10 2 6 12 13 5 7 8 9 14 5 7 11 15 5 7 0 4])
Raises:
  • TypeError – Raised if either keys or arr is not a pdarray, Strings, or Categorical object

  • RuntimeError – Raised if the dtype of either array is not supported

class arkouda.inexact(value)

Bases: numpy.number

Abstract base class of all numeric scalar types with a (potentially)

inexact representation of the values in its range, such as floating-point numbers.

arkouda.inf: float
arkouda.information(names: List[str] | str = RegisteredSymbols) str[source]

Returns JSON formatted string containing information about the objects in names

Parameters:

names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry

Returns:

JSON formatted string containing a list of information for each object in names

Return type:

str

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names

arkouda.infty: float
class arkouda.int16(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C short.

Character code:

'h'

Canonical name:

numpy.short

Alias on this platform (Linux x86_64):

numpy.int16: 16-bit signed integer (-32_768 to 32_767).

bit_count(*args, **kwargs)

int16.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int16(127).bit_count()
7
>>> np.int16(-127).bit_count()
7
class arkouda.int32(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C int.

Character code:

'i'

Canonical name:

numpy.intc

Alias on this platform (Linux x86_64):

numpy.int32: 32-bit signed integer (-2_147_483_648 to 2_147_483_647).

bit_count(*args, **kwargs)

int32.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int32(127).bit_count()
7
>>> np.int32(-127).bit_count()
7
class arkouda.int64(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.int64(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.int8(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C char.

Character code:

'b'

Canonical name:

numpy.byte

Alias on this platform (Linux x86_64):

numpy.int8: 8-bit signed integer (-128 to 127).

bit_count(*args, **kwargs)

int8.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int8(127).bit_count()
7
>>> np.int8(-127).bit_count()
7
class arkouda.intTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.intTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.intTypes

frozenset() -> empty frozenset object frozenset(iterable) -> frozenset object

Build an immutable unordered collection of unique elements.

copy(*args, **kwargs)

Return a shallow copy of a set.

difference(*args, **kwargs)

Return the difference of two or more sets as a new set.

(i.e. all elements that are in this set but not the others.)

intersection(*args, **kwargs)

Return the intersection of two sets as a new set.

(i.e. all elements that are in both sets.)

isdisjoint(*args, **kwargs)

Return True if two sets have a null intersection.

issubset(*args, **kwargs)

Report whether another set contains this set.

issuperset(*args, **kwargs)

Report whether this set contains another set.

symmetric_difference(*args, **kwargs)

Return the symmetric difference of two sets as a new set.

(i.e. all elements that are in exactly one of the sets.)

union(*args, **kwargs)

Return the union of sets as a new set.

(i.e. all elements that are in either set.)

class arkouda.int_(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
class arkouda.int_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.int_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.int_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.intc(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C int.

Character code:

'i'

Canonical name:

numpy.intc

Alias on this platform (Linux x86_64):

numpy.int32: 32-bit signed integer (-2_147_483_648 to 2_147_483_647).

bit_count(*args, **kwargs)

int32.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int32(127).bit_count()
7
>>> np.int32(-127).bit_count()
7
class arkouda.integer(value)

Bases: numpy.number

Abstract base class of all integer scalar types.

denominator(*args, **kwargs)

denominator of value (1)

is_integer(*args, **kwargs)

integer.is_integer() -> bool

Return True if the number is finite with integral value.

Added in version 1.22.

>>> np.int64(-2).is_integer()
True
>>> np.uint32(5).is_integer()
True
numerator(*args, **kwargs)

numerator of value (the value itself)

arkouda.intersect(a, b, positions=True, unique=False)[source]

Find the intersection of two arkouda arrays.

This function can be especially useful when positions=True so that the caller gets the indices of values present in both arrays.

Parameters:
  • a (Strings or pdarray) – An array of strings.

  • b (Strings or pdarray) – An array of strings.

  • positions (bool, default=True) – Return tuple of boolean pdarrays that indicate positions in a and b of the intersection values.

  • unique (bool, default=False) – If the number of distinct values in a (and b) is equal to the size of a (and b), there is a more efficient method to compute the intersection.

Returns:

The indices of a and b where any element occurs at least once in both arrays.

Return type:

(arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray) or arkouda.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.arange(10)
>>> print(a)
[0 1 2 3 4 5 6 7 8 9]
>>> b = 2 * ak.arange(10)
>>> print(b)
[0 2 4 6 8 10 12 14 16 18]
>>> intersect(a,b, positions=True)
(array([True False True False True False True False True False]),
array([True True True True True False False False False False]))
>>> intersect(a,b, positions=False)
array([0 2 4 6 8])
arkouda.intersect1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable[source]

Find the intersection of two arrays.

Return the sorted, unique values that are in both of the input arrays.

Parameters:
  • pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects

  • pda2 (pdarray/List) – Input array/sequence of groupable objects

  • assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.

Returns:

Sorted 1D array/List of sorted pdarrays of common and unique elements.

Return type:

pdarray/groupable

Raises:
  • TypeError – Raised if either pda1 or pda2 is not a pdarray

  • RuntimeError – Raised if the dtype of either pdarray is not supported

Notes

ak.intersect1d is not supported for bool or float64 pdarrays

Examples

>>>
# 1D Example
>>> ak.intersect1d([1, 3, 4, 3], [3, 1, 2, 1])
array([1, 3])
# Multi-Array Example
>>> a = ak.arange(5)
>>> b = ak.array([1, 5, 3, 4, 2])
>>> c = ak.array([1, 4, 3, 2, 5])
>>> d = ak.array([1, 2, 3, 5, 4])
>>> multia = [a, a, a]
>>> multib = [b, c, d]
>>> ak.intersect1d(multia, multib)
[array([1, 3]), array([1, 3]), array([1, 3])]
arkouda.interval_lookup(keys, values, arguments, fillvalue=-1, tiebreak=None, hierarchical=False)[source]

Apply a function defined over intervals to an array of arguments.

Parameters:
  • keys (2-tuple of (sequences of) pdarrays) – Tuple of closed intervals expressed as (lower_bounds_inclusive, upper_bounds_inclusive). Must have same dtype(s) as vals.

  • values (pdarray) – Function value to return for each entry in keys.

  • arguments ((sequences of) pdarray) – Values to search for in intervals. If multiple arrays, each “row” is an item.

  • fillvalue (scalar) – Default value to return when argument is not in any interval.

  • tiebreak ((optional) pdarray, numeric) – When an argument is present in more than one key interval, the interval with the lowest tiebreak value will be chosen. If no tiebreak is given, the first valid key interval will be chosen.

Returns:

Value of function corresponding to the keys interval containing each argument, or fillvalue if argument not in any interval.

Return type:

pdarray

class arkouda.intp(value)

Bases: numpy.signedinteger

Signed integer type, compatible with Python int and C long.

Character code:

'l'

Canonical name:

numpy.int_

Alias on this platform (Linux x86_64):

numpy.int64: 64-bit signed integer (-9_223_372_036_854_775_808 to 9_223_372_036_854_775_807).

Alias on this platform (Linux x86_64):

numpy.intp: Signed integer large enough to fit pointer, compatible with C intptr_t.

bit_count(*args, **kwargs)

int64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int64(127).bit_count()
7
>>> np.int64(-127).bit_count()
7
arkouda.intx(a, b)[source]

Find all the rows that are in both dataframes. Columns should be in identical order.

Note: does not work for columns of floating point values, but does work for Strings, pdarrays of int64 type, and Categorical should work.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.DataFrame({'a':ak.arange(5),'b': 2* ak.arange(5)})
>>> display(a)

a

b

0

0

0

1

1

2

2

2

4

3

3

6

4

4

8

>>> b = ak.DataFrame({'a':ak.arange(5),'b':ak.array([0,3,4,7,8])})
>>> display(b)

a

b

0

0

0

1

1

3

2

2

4

3

3

7

4

4

8

>>> intx(a,b)
>>> intersect_df = a[intx(a,b)]
>>> display(intersect_df)

a

b

0

0

0

1

2

4

2

4

8

arkouda.invert_permutation(perm)[source]

Find the inverse of a permutation array.

Parameters:

perm (pdarray) – The permutation array.

Returns:

The inverse of the permutation array.

Return type:

arkouda.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.index import Index
>>> i = Index(ak.array([1,2,0,5,4]))
>>> perm = i.argsort()
>>> print(perm)
[2 0 1 4 3]
>>> invert_permutation(perm)
array([1 2 0 4 3])
arkouda.ip_address(values)[source]

Convert values to an Arkouda array of IP addresses.

Parameters:

values (list-like, integer pdarray, or IPv4) – The integer IP addresses or IPv4 object.

Returns:

The same IP addresses as an Arkouda array

Return type:

IPv4

Notes

This helper is intended to help future proof changes made to accomodate IPv6 and to prevent errors if a user inadvertently casts a IPv4 instead of a int64 pdarray. It can also be used for importing Python lists of IP addresses into Arkouda.

arkouda.isSupportedBool(num)[source]
arkouda.isSupportedDType(scalar)[source]
arkouda.isSupportedFloat(num)[source]
arkouda.isSupportedInt(num)[source]
arkouda.isSupportedInt(num)[source]
arkouda.isSupportedInt(num)[source]
arkouda.isSupportedInt(num)[source]
arkouda.isSupportedNumber(num)[source]
arkouda.is_cosorted(arrays)[source]

Return True iff the arrays are cosorted, i.e., if the arrays were columns in a table then the rows are sorted.

Parameters:

arrays (list-like of pdarrays) – Arrays to check for cosortedness

Returns:

True iff arrays are cosorted.

Return type:

bool

Raises:
  • ValueError – Raised if arrays are not the same length

  • TypeError – Raised if arrays is not a list-like of pdarrays

arkouda.is_ipv4(ip: arkouda.pdarrayclass.pdarray | IPv4, ip2: arkouda.pdarrayclass.pdarray | None = None) arkouda.pdarrayclass.pdarray[source]

Indicate which values are ipv4 when passed data containing IPv4 and IPv6 values.

Parameters:
  • ip (pdarray (int64) or ak.IPv4)

  • in. (IPv4 value. High Bits of IPv6 if IPv6 is passed)

  • ip2 (pdarray (int64), Optional)

  • well. (Low Bits of IPv6. This is added for support when dealing with data that contains IPv6 as)

Return type:

pdarray of bools indicating which indexes are IPv4.

See also

ak.is_ipv6

arkouda.is_ipv6(ip: arkouda.pdarrayclass.pdarray | IPv4, ip2: arkouda.pdarrayclass.pdarray | None = None) arkouda.pdarrayclass.pdarray[source]

Indicate which values are ipv6 when passed data containing IPv4 and IPv6 values.

Parameters:
Return type:

pdarray of bools indicating which indexes are IPv6.

See also

ak.is_ipv4

arkouda.is_registered(name: str, as_component: bool = False) bool[source]

Determine if the name provided is associated with a registered Object

Parameters:
  • name (str) – The name to check for in the registry

  • as_component (bool) – Default: False When True, the name will be checked to determine if it is registered as a component of a registered object

Return type:

bool

arkouda.isfinite(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise isfinite check applied to the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing boolean values indicating whether the input array elements are finite

Return type:

pdarray

Raises:
  • TypeError – Raised if the parameter is not a pdarray

  • RuntimeError – if the underlying pdarray is not float-based

Examples

>>> ak.isfinite(ak.array[1.0, 2.0, ak.inf])
array([True, True, False])
arkouda.isinf(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise isinf check applied to the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing boolean values indicating whether the input array elements are infinite

Return type:

pdarray

Raises:
  • TypeError – Raised if the parameter is not a pdarray

  • RuntimeError – if the underlying pdarray is not float-based

Examples

>>> ak.isinf(ak.array[1.0, 2.0, ak.inf])
array([False, False, True])
arkouda.isnan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise isnan check applied to the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing boolean values indicating whether the input array elements are NaN

Return type:

pdarray

Raises:
  • TypeError – Raised if the parameter is not a pdarray

  • RuntimeError – if the underlying pdarray is not float-based

Examples

>>> ak.isnan(ak.array[1.0, 2.0, 1.0 / 0.0])
array([False, False, True])
arkouda.isnan(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise isnan check applied to the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing boolean values indicating whether the input array elements are NaN

Return type:

pdarray

Raises:
  • TypeError – Raised if the parameter is not a pdarray

  • RuntimeError – if the underlying pdarray is not float-based

Examples

>>> ak.isnan(ak.array[1.0, 2.0, 1.0 / 0.0])
array([False, False, True])
arkouda.isscalar(element)

Returns True if the type of element is a scalar type.

Parameters:

element (any) – Input argument, can be of any type and shape.

Returns:

val – True if element is a scalar type, False if it is not.

Return type:

bool

See also

ndim

Get the number of dimensions of an array

Notes

If you need a stricter way to identify a numerical scalar, use isinstance(x, numbers.Number), as that returns False for most non-numerical elements such as strings.

In most cases np.ndim(x) == 0 should be used instead of this function, as that will also return true for 0d arrays. This is how numpy overloads functions in the style of the dx arguments to gradient and the bins argument to histogram. Some key differences:

x

isscalar(x)

np.ndim(x) == 0

PEP 3141 numeric objects (including builtins)

True

True

builtin string and buffer objects

True

True

other builtin objects, like pathlib.Path, Exception, the result of re.compile

False

True

third-party objects like matplotlib.figure.Figure

False

True

zero-dimensional numpy arrays

False

True

other numpy arrays

False

False

list, tuple, and other sequence objects

False

False

Examples

>>> np.isscalar(3.1)
True
>>> np.isscalar(np.array(3.1))
False
>>> np.isscalar([3.1])
False
>>> np.isscalar(False)
True
>>> np.isscalar('numpy')
True

NumPy supports PEP 3141 numbers:

>>> from fractions import Fraction
>>> np.isscalar(Fraction(5, 17))
True
>>> from numbers import Number
>>> np.isscalar(Number())
True
arkouda.issctype(rep)

Determines whether the given object represents a scalar data-type.

Parameters:

rep (any) – If rep is an instance of a scalar dtype, True is returned. If not, False is returned.

Returns:

out – Boolean result of check whether rep is a scalar dtype.

Return type:

bool

See also

issubsctype, issubdtype, obj2sctype, sctype2char

Examples

>>> np.issctype(np.int32)
True
>>> np.issctype(list)
False
>>> np.issctype(1.1)
False

Strings are also a scalar type:

>>> np.issctype(np.dtype('str'))
True
arkouda.issubclass_(arg1, arg2)

Determine if a class is a subclass of a second class.

issubclass_ is equivalent to the Python built-in issubclass, except that it returns False instead of raising a TypeError if one of the arguments is not a class.

Parameters:
  • arg1 (class) – Input class. True is returned if arg1 is a subclass of arg2.

  • arg2 (class or tuple of classes.) – Input class. If a tuple of classes, True is returned if arg1 is a subclass of any of the tuple elements.

Returns:

out – Whether arg1 is a subclass of arg2 or not.

Return type:

bool

See also

issubsctype, issubdtype, issctype

Examples

>>> np.issubclass_(np.int32, int)
False
>>> np.issubclass_(np.int32, float)
False
>>> np.issubclass_(np.float64, float)
True
arkouda.issubdtype(arg1, arg2)

Returns True if first argument is a typecode lower/equal in type hierarchy.

This is like the builtin issubclass(), but for dtypes.

Parameters:
  • arg1 (dtype_like) – dtype or object coercible to one

  • arg2 (dtype_like) – dtype or object coercible to one

Returns:

out

Return type:

bool

See also

arrays.scalars

Overview of the numpy type hierarchy.

issubsctype, issubclass_

Examples

issubdtype can be used to check the type of arrays:

>>> ints = np.array([1, 2, 3], dtype=np.int32)
>>> np.issubdtype(ints.dtype, np.integer)
True
>>> np.issubdtype(ints.dtype, np.floating)
False
>>> floats = np.array([1, 2, 3], dtype=np.float32)
>>> np.issubdtype(floats.dtype, np.integer)
False
>>> np.issubdtype(floats.dtype, np.floating)
True

Similar types of different sizes are not subdtypes of each other:

>>> np.issubdtype(np.float64, np.float32)
False
>>> np.issubdtype(np.float32, np.float64)
False

but both are subtypes of floating:

>>> np.issubdtype(np.float64, np.floating)
True
>>> np.issubdtype(np.float32, np.floating)
True

For convenience, dtype-like objects are allowed too:

>>> np.issubdtype('S1', np.string_)
True
>>> np.issubdtype('i4', np.signedinteger)
True
arkouda.join_on_eq_with_dt(a1: arkouda.pdarrayclass.pdarray, a2: arkouda.pdarrayclass.pdarray, t1: arkouda.pdarrayclass.pdarray, t2: arkouda.pdarrayclass.pdarray, dt: int | numpy.int64, pred: str, result_limit: int | numpy.int64 = 1000) Tuple[arkouda.pdarrayclass.pdarray, arkouda.pdarrayclass.pdarray][source]

Performs an inner-join on equality between two integer arrays where the time-window predicate is also true

Parameters:
  • a1 (pdarray, int64) – pdarray to be joined

  • a2 (pdarray, int64) – pdarray to be joined

  • t1 (pdarray) – timestamps in millis corresponding to the a1 pdarray

  • t2 (pdarray) – timestamps in millis corresponding to the a2 pdarray

  • dt (Union[int,np.int64]) – time delta

  • pred (str) – time window predicate

  • result_limit (Union[int,np.int64]) – size limit for returned result

Returns:

  • result_array_one (pdarray, int64) – a1 indices where a1 == a2

  • result_array_one (pdarray, int64) – a2 indices where a2 == a1

Raises:
  • TypeError – Raised if a1, a2, t1, or t2 is not a pdarray, or if dt or result_limit is not an int

  • ValueError – if a1, a2, t1, or t2 dtype is not int64, pred is not ‘true_dt’, ‘abs_dt’, or ‘pos_dt’, or result_limit is < 0

arkouda.left_align(left, right)[source]

Map two arrays of sparse identifiers to the 0-up index set implied by the left array, discarding values from right that do not appear in left.

arkouda.list_registry(detailed: bool = False)[source]

Return a list containing the names of all registered objects

Parameters:

detailed (bool) – Default = False Return details of registry objects. Currently includes object type for any objects

Returns:

Dict containing keys “Components” and “Objects”.

Return type:

dict

Raises:

RuntimeError – Raised if there’s a server-side error thrown

arkouda.list_symbol_table() List[str][source]

Return a list containing the names of all objects in the symbol table

Parameters:

None

Returns:

List of all object names in the symbol table

Return type:

list

Raises:

RuntimeError – Raised if there’s a server-side error thrown

arkouda.load(path_prefix: str, file_format: str = 'INFER', dataset: str = 'array', calc_string_offsets: bool = False, column_delim: str = ',') Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index][source]

Load a pdarray previously saved with pdarray.save().

Parameters:
  • path_prefix (str) – Filename prefix used to save the original pdarray

  • file_format (str) – ‘INFER’, ‘HDF5’ or ‘Parquet’. Defaults to ‘INFER’. Used to indicate the file type being loaded. If INFER, this will be detected during processing

  • dataset (str) – Dataset name where the pdarray was saved, defaults to ‘array’

  • calc_string_offsets (bool) – If True the server will ignore Segmented Strings ‘offsets’ array and derive it from the null-byte terminators. Defaults to False currently

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

Returns:

Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]

Raises:
  • TypeError – Raised if either path_prefix or dataset is not a str

  • ValueError – Raised if invalid file_format or if the dataset is not present in all hdf5 files or if the path_prefix does not correspond to files accessible to Arkouda

  • RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them

Notes

If you have a previously saved Parquet file that is raising a FileNotFound error, try loading it with a .parquet appended to the prefix_path. Parquet files were previously ALWAYS stored with a .parquet extension.

ak.load does not support loading a single file. For loading single HDF5 files without the _LOCALE#### suffix please use ak.read().

CSV files without the Arkouda Header are not supported.

Examples

>>> # Loading from file without extension
>>> obj = ak.load('path/prefix')
Loads the array from numLocales files with the name ``cwd/path/name_prefix_LOCALE####``.
The file type is inferred during processing.
>>> # Loading with an extension (HDF5)
>>> obj = ak.load('path/prefix.test')
Loads the object from numLocales files with the name ``cwd/path/name_prefix_LOCALE####.test`` where
#### is replaced by each locale numbers. Because filetype is inferred during processing,
the extension is not required to be a specific format.
arkouda.load_all(path_prefix: str, file_format: str = 'INFER', column_delim: str = ',', read_nested=True) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical][source]

Load multiple pdarrays, Strings, SegArrays, or Categoricals previously saved with save_all().

Parameters:
  • path_prefix (str) – Filename prefix used to save the original pdarray

  • file_format (str) – ‘INFER’, ‘HDF5’, ‘Parquet’, or ‘CSV’. Defaults to ‘INFER’. Indicates the format being loaded. When ‘INFER’ the processing will detect the format Defaults to ‘INFER’

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Parquet files only

Returns:

Dictionary of {datsetName: Union[pdarray, Strings, SegArray, Categorical]} with the previously saved pdarrays, Strings, SegArrays, or Categoricals

Return type:

Mapping[str, Union[pdarray, Strings, SegArray, Categorical]]

Raises:
  • TypeError: – Raised if path_prefix is not a str

  • ValueError – Raised if file_format/extension is encountered that is not hdf5 or parquet or if all datasets are not present in all hdf5/parquet files or if the path_prefix does not correspond to files accessible to Arkouda

  • RuntimeError – Raised if the hdf5 files are present but there is an error in opening one or more of them

See also

to_parquet, to_hdf, load, read

Notes

This function has been updated to determine the file extension based on the file format variable

This function will be deprecated when glob flags are added to read_* methods

CSV files without the Arkouda Header are not supported.

arkouda.log(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise natural log of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing natural log values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Notes

Logarithms with other bases can be computed as follows:

Examples

>>> A = ak.array([1, 10, 100])
# Natural log
>>> ak.log(A)
array([0, 2.3025850929940459, 4.6051701859880918])
# Log base 10
>>> ak.log(A) / np.log(10)
array([0, 1, 2])
# Log base 2
>>> ak.log(A) / np.log(2)
array([0, 3.3219280948873626, 6.6438561897747253])
arkouda.log10(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise base 10 log of the array.

Parameters:

pda (pdarray) – array to compute on

Return type:

pdarray contain values of the base 10 log

arkouda.log1p(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise natural log of one plus the array.

Parameters:

pda (pdarray) – array to compute on

Return type:

pdarray contain values of the natural log of one plus the array

arkouda.log2(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise base 2 log of the array.

Parameters:

pda (pdarray) – array to compute on

Return type:

pdarray contain values of the base 2 log

class arkouda.longdouble(value)

Bases: numpy.floating

Extended-precision floating-point number type, compatible with C

long double but not necessarily with IEEE 754 quadruple-precision.

Character code:

'g'

Alias:

numpy.longfloat

Alias on this platform (Linux x86_64):

numpy.float128: 128-bit extended-precision floating-point number type.

as_integer_ratio(*args, **kwargs)

longdouble.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.longdouble(10.0).as_integer_ratio()
(10, 1)
>>> np.longdouble(0.0).as_integer_ratio()
(0, 1)
>>> np.longdouble(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

longdouble.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.longdouble(-2.0).is_integer()
True
>>> np.longdouble(3.2).is_integer()
False
class arkouda.longfloat(value)

Bases: numpy.floating

Extended-precision floating-point number type, compatible with C

long double but not necessarily with IEEE 754 quadruple-precision.

Character code:

'g'

Alias:

numpy.longfloat

Alias on this platform (Linux x86_64):

numpy.float128: 128-bit extended-precision floating-point number type.

as_integer_ratio(*args, **kwargs)

longdouble.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.longdouble(10.0).as_integer_ratio()
(10, 1)
>>> np.longdouble(0.0).as_integer_ratio()
(0, 1)
>>> np.longdouble(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

longdouble.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.longdouble(-2.0).is_integer()
True
>>> np.longdouble(3.2).is_integer()
False
class arkouda.longlong(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C long long.

Character code:

'q'

bit_count(*args, **kwargs)
arkouda.lookup(keys, values, arguments, fillvalue=-1)[source]

Apply the function defined by the mapping keys –> values to arguments.

Parameters:
  • keys ((sequence of) array-like) – The domain of the function. Entries must be unique (if a sequence of arrays is given, each row is treated as a tuple-valued entry).

  • values (pdarray) – The range of the function. Must be same length as keys.

  • arguments ((sequence of) array-like) – The arguments on which to evaluate the function. Must have same dtype (or tuple of dtypes, for a sequence) as keys.

  • fillvalue (scalar) – The default value to return for arguments not in keys.

Returns:

evaluated – The result of evaluating the function over arguments.

Return type:

pdarray

Notes

While the values cannot be Strings (or other complex objects), the same result can be achieved by passing an arange as the values, then using the return as indices into the desired object.

Examples

# Lookup numbers by two-word name >>> keys1 = ak.array([‘twenty’ for _ in range(5)]) >>> keys2 = ak.array([‘one’, ‘two’, ‘three’, ‘four’, ‘five’]) >>> values = ak.array([21, 22, 23, 24, 25]) >>> args1 = ak.array([‘twenty’, ‘thirty’, ‘twenty’]) >>> args2 = ak.array([‘four’, ‘two’, ‘two’]) >>> aku.lookup([keys1, keys2], values, [args1, args2]) array([24, -1, 22])

# Other direction requires an intermediate index >>> revkeys = values >>> revindices = ak.arange(values.size) >>> revargs = ak.array([24, 21, 22]) >>> idx = aku.lookup(revkeys, revindices, revargs) >>> keys1[idx], keys2[idx] (array([‘twenty’, ‘twenty’, ‘twenty’]), array([‘four’, ‘one’, ‘two’]))

arkouda.ls(filename: str, col_delim: str = ',', read_nested: bool = True) List[str][source]

This function calls the h5ls utility on a HDF5 file visible to the arkouda server or calls a function that imitates the result of h5ls on a Parquet file.

Parameters:
  • filename (str) – The name of the file to pass to the server

  • col_delim (str) – The delimiter used to separate columns if the file is a csv

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Only used for Parquet files.

Returns:

The string output of the datasets from the server

Return type:

str

Raises:
  • TypeError – Raised if filename is not a str

  • ValueError – Raised if filename is empty or contains only whitespace

  • RuntimeError – Raised if error occurs in executing ls on an HDF5 file

  • Notes

    • This will need to be updated because Parquet will not technically support this when we update.

      Similar functionality will be added for Parquet in the future

    • For CSV files without headers, please use ls_csv

See also

ls_csv

arkouda.ls_csv(filename: str, col_delim: str = ',') List[str][source]

Used for identifying the datasets within a file when a CSV does not have a header.

Parameters:
  • filename (str) – The name of the file to pass to the server

  • col_delim (str) – The delimiter used to separate columns if the file is a csv

Returns:

The string output of the datasets from the server

Return type:

str

See also

ls

arkouda.matmul(pdaLeft: arkouda.pdarrayclass.pdarray, pdaRight: arkouda.pdarrayclass.pdarray)[source]

Compute the product of two matrices.

Parameters:
Returns:

the matrix product pdaLeft x pdaRight

Return type:

pdarray

Examples

>>> a = ak.array([[1,2,3,4,5],[1,2,3,4,5]])
>>> b = ak.array(([1,1],[2,2],[3,3],[4,4],[5,5]])
>>> ak.matmul(a,b)
array([array([30 30]) array([45 45])])
>>> x = ak.array([[1,2,3],[1.1,2.1,3.1]])
>>> y = ak.array([[1,1,1],[0,2,2],[0,0,3]])
>>> ak.matmul(x,y)
array([array([1.00000000000000000 5.00000000000000000 14.00000000000000000])
array([1.1000000000000001 5.3000000000000007 14.600000000000001])])

Notes

Server returns an error if shapes of pdaLeft and pdaRight are incompatible with matrix multiplication.

arkouda.maximum_sctype(t)

Return the scalar type of highest precision of the same kind as the input.

Parameters:

t (dtype or dtype specifier) – The input data type. This can be a dtype object or an object that is convertible to a dtype.

Returns:

out – The highest precision data type of the same kind (dtype.kind) as t.

Return type:

dtype

See also

obj2sctype, mintypecode, sctype2char, dtype

Examples

>>> np.maximum_sctype(int)
<class 'numpy.int64'>
>>> np.maximum_sctype(np.uint8)
<class 'numpy.uint64'>
>>> np.maximum_sctype(complex)
<class 'numpy.complex256'> # may vary
>>> np.maximum_sctype(str)
<class 'numpy.str_'>
>>> np.maximum_sctype('i2')
<class 'numpy.int64'>
>>> np.maximum_sctype('f4')
<class 'numpy.float128'> # may vary
arkouda.maxk(pda: pdarray, k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Find the k maximum values of an array.

Returns the largest k values of an array, sorted

Parameters:
  • pda (pdarray) – Input array.

  • k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda, sorted

Return type:

pdarray, int

Raises:
  • TypeError – Raised if pda is not a pdarray or k is not an integer

  • ValueError – Raised if the pda is empty or k < 1

Notes

This call is equivalent in value to:

a[ak.argsort(a)[k:]]

and generally outperforms this operation.

This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degredation has been observed.

Examples

>>> A = ak.array([10,5,1,3,7,2,9,0])
>>> ak.maxk(A, 3)
array([7, 9, 10])
>>> ak.maxk(A, 4)
array([5, 7, 9, 10])
arkouda.mean(pda: pdarray) numpy.float64[source]

Return the mean of the array.

Parameters:

pda (pdarray) – Values for which to calculate the mean

Returns:

The mean calculated from the pda sum and size

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

arkouda.median(pda)[source]

Compute the median of a given array. 1d case only, for now.

Parameters:

pda (pdarray) – The input data, in pdarray form, numeric type or boolean

Returns:

The median of the entire pdarray The array is sorted, and then if the number of elements is odd,

the return value is the middle element. If even, then the mean of the two middle elements.

Return type:

np.float64

Examples

>>> import arkouda as ak
>>> arkouda.connect()
>>> pda = ak.array ([0,4,7,8,1,3,5,2,-1])
>>> ak.median(pda)
3
>>> pda = ak.array([0,1,3,3,1,2,3,4,2,3])
2.5
arkouda.merge(left: DataFrame, right: DataFrame, on: str | List[str] | None = None, how: str = 'inner', left_suffix: str = '_x', right_suffix: str = '_y', convert_ints: bool = True, sort: bool = True) DataFrame[source]

Merge Arkouda DataFrames with a database-style join. The resulting dataframe contains rows from both DataFrames as specified by the merge condition (based on the “how” and “on” parameters).

Based on pandas merge functionality. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Parameters:
  • left (DataFrame) – The Left DataFrame to be joined.

  • right (DataFrame) – The Right DataFrame to be joined.

  • on (Optional[Union[str, List[str]]] = None) – The name or list of names of the DataFrame column(s) to join on. If on is None, this defaults to the intersection of the columns in both DataFrames.

  • how (str, default = "inner") – The merge condition. Must be one of “inner”, “left”, “right”, or “outer”.

  • left_suffix (str, default = "_x") – A string indicating the suffix to add to columns from the left dataframe for overlapping column names in both left and right. Defaults to “_x”. Only used when how is “inner”.

  • right_suffix (str, default = "_y") – A string indicating the suffix to add to columns from the right dataframe for overlapping column names in both left and right. Defaults to “_y”. Only used when how is “inner”.

  • convert_ints (bool = True) – If True, convert columns with missing int values (due to the join) to float64. This is to match pandas. If False, do not convert the column dtypes. This has no effect when how = “inner”.

  • sort (bool = True) – If True, DataFrame is returned sorted by “on”. Otherwise, the DataFrame is not sorted.

Returns:

Joined Arkouda DataFrame.

Return type:

arkouda.dataframe.DataFrame

Note

Multiple column joins are only supported for integer columns.

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda import merge
>>> left_df = ak.DataFrame({'col1': ak.arange(5), 'col2': -1 * ak.arange(5)})
>>> display(left_df)

col1

col2

0

0

0

1

1

-1

2

2

-2

3

3

-3

4

4

-4

>>> right_df = ak.DataFrame({'col1': 2 * ak.arange(5), 'col2': 2 * ak.arange(5)})
>>> display(right_df)

col1

col2

0

0

0

1

2

2

2

4

4

3

6

6

4

8

8

>>> merge(left_df, right_df, on = "col1")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

>>> merge(left_df, right_df, on = "col1", how = "left")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

>>> merge(left_df, right_df, on = "col1", how = "right")

col1

col2_x

col2_y

0

0

0

0

1

2

-2

2

2

4

-4

4

3

6

nan

6

4

8

nan

8

>>> merge(left_df, right_df, on = "col1", how = "outer")

col1

col2_y

col2_x

0

0

0

0

1

1

nan

-1

2

2

2

-2

3

3

nan

-3

4

4

4

-4

5

6

6

nan

6

8

8

nan

arkouda.mink(pda: pdarray, k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Find the k minimum values of an array.

Returns the smallest k values of an array, sorted

Parameters:
  • pda (pdarray) – Input array.

  • k (int_scalars) – The desired count of minimum values to be returned by the output.

Returns:

The minimum k values from pda, sorted

Return type:

pdarray

Raises:
  • TypeError – Raised if pda is not a pdarray

  • ValueError – Raised if the pda is empty or k < 1

Notes

This call is equivalent in value to:

a[ak.argsort(a)[:k]]

and generally outperforms this operation.

This reduction will see a significant drop in performance as k grows beyond a certain value. This value is system dependent, but generally about a k of 5 million is where performance degredation has been observed.

Examples

>>> A = ak.array([10,5,1,3,7,2,9,0])
>>> ak.mink(A, 3)
array([0, 1, 2])
>>> ak.mink(A, 4)
array([0, 1, 2, 3])
arkouda.mod(dividend, divisor) pdarray[source]

Returns the element-wise remainder of division.

Computes the remainder complementary to the floor_divide function. It is equivalent to np.mod, the remainder has the same sign as the divisor.

Parameters:
  • dividend – The array being acted on by the bases for the modular division.

  • divisor – The array that will be the bases for the modular division.

Returns:

Returns an array that contains the element-wise remainder of division.

Return type:

pdarray

arkouda.nan: float
class arkouda.number(value)

Bases: numpy.generic

Abstract base class of all numeric scalar types.

class arkouda.numeric_and_bool_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.numeric_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.numpy_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

class arkouda.object_(value)

Bases: numpy.generic

Any Python object.

Character code:

'O'

arkouda.parity(pda: pdarray) pdarray[source]

Find the bit parity (XOR of all bits) for each integer in an array.

Parameters:

pda (pdarray, int64, uint64, bigint) – Input array (must be integral).

Returns:

parity – The parity of each element: 0 if even number of bits set, 1 if odd.

Return type:

pdarray

Raises:

TypeError – If input array is not int64, uint64, or bigint

Examples

>>> A = ak.arange(10)
>>> ak.parity(A)
array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])
class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

class arkouda.pdarray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The basic arkouda array class. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a pdarray instance that points to the array data on the server. As such, the user should not initialize pdarray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The number of elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 1 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

itemsize

The size in bytes of each element

Type:

int_scalars

BinOps
OpEqOps
all(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff all elements of the array evaluate to True.

any(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff any element of the array evaluates to True.

argmax(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array max value.

argmaxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Finds the indices corresponding to the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values, sorted

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

argmin(axis: int | None | None = None, keepdims: bool = False) numpy.int64 | numpy.uint64 | pdarray[source]

Return the index of the first occurrence of the array min value

argmink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

Indices corresponding to the maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

astype(dtype) pdarray[source]

Cast values of pdarray to provided dtype

Parameters:

dtype (np.dtype or str) – Dtype to cast to

Returns:

An arkouda pdarray with values converted to the specified data type

Return type:

ak.pdarray

Notes

This is essentially shorthand for ak.cast(x, ‘<dtype>’) where x is a pdarray.

static attach(user_defined_name: str) pdarray[source]

class method to return a pdarray attached to the registered name in the arkouda server which was registered using register()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Returns:

pdarray which is bound to the corresponding server side component which was registered with user_defined_name

Return type:

pdarray

Raises:

TypeError – Raised if user_defined_name is not a str

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
bigint_to_uint_arrays() List[pdarray][source]

Creates a list of uint pdarrays from a bigint pdarray. The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Returns:

A list of uint pdarrays where: The first item in return will be the highest 64 bits of the bigint pdarray and the last item will be the lowest 64 bits.

Return type:

List[pdarrays]

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> a = ak.arange(2**64, 2**64 + 5)
>>> a
array(["18446744073709551616" "18446744073709551617" "18446744073709551618"
"18446744073709551619" "18446744073709551620"])
>>> a.bigint_to_uint_arrays()
[array([1 1 1 1 1]), array([0 1 2 3 4])]
clz() pdarray[source]

Count the number of leading zeros in each element. See ak.clz.

corr(y: pdarray) numpy.float64[source]

Compute the correlation between self and y using pearson correlation coefficient.

Parameters:

y (pdarray) – Other pdarray used to calculate correlation

Returns:

The scalar correlation of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

cov(y: pdarray) numpy.float64[source]

Compute the covariance between self and y.

Parameters:

y (pdarray) – Other pdarray used to calculate covariance

Returns:

The scalar covariance of the two arrays

Return type:

np.float64

Raises:
  • TypeError – Raised if y is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

ctz() pdarray[source]

Count the number of trailing zeros in each element. See ak.ctz.

dtype
equals(other) arkouda.numpy.dtypes.bool_scalars[source]

Whether pdarrays are the same size and all entries are equal.

Parameters:

other (object) – object to compare.

Returns:

True if the pdarrays are the same, o.w. False.

Return type:

bool

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> a = ak.array([1, 2, 3])
>>> a_cpy = ak.array([1, 2, 3])
>>> a.equals(a_cpy)
True
>>> a2 = ak.array([1, 2, 5)
>>> a.equals(a2)
False
fill(value: arkouda.numpy.dtypes.numeric_scalars) None[source]

Fill the array (in place) with a constant value.

Parameters:

value (numeric_scalars)

Raises:

TypeError – Raised if value is not an int, int64, float, or float64

flatten()[source]

Return a copy of the array collapsed into one dimension.

Return type:

A copy of the input array, flattened to one dimension.

format_other(other) str[source]

Attempt to cast scalar other to the element dtype of this pdarray, and print the resulting value to a string (e.g. for sending to a server command). The user should not call this function directly.

Parameters:

other (object) – The scalar to be cast to the pdarray.dtype

Return type:

string representation of np.dtype corresponding to the other parameter

Raises:

TypeError – Raised if the other parameter cannot be converted to Numpy dtype

property inferred_type: str | None

Return a string of the type inferred from the values.

info() str[source]

Returns a JSON formatted string containing information about all components of self

Parameters:

None

Returns:

JSON string containing information about all components of self

Return type:

str

is_registered() numpy.bool_[source]

Return True iff the object is contained in the registry

Parameters:

None

Returns:

Indicates if the object is contained in the registry

Return type:

bool

Raises:

RuntimeError – Raised if there’s a server-side error thrown

Note

This will return True if the object is registered itself or as a component of another object

is_sorted(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.bool_scalars | pdarray[source]

Return True iff the array is monotonically non-decreasing.

Parameters:

None

Returns:

Indicates if the array is monotonically non-decreasing

Return type:

bool

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

itemsize
max(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the maximum value of the array.

property max_bits
maxk(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the maximum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

mean() numpy.float64[source]

Return the mean of the array.

min(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the minimum value of the array.

mink(k: arkouda.numpy.dtypes.int_scalars) pdarray[source]

Compute the minimum “k” values.

Parameters:

k (int_scalars) – The desired count of maximum values to be returned by the output.

Returns:

The maximum k values from pda

Return type:

pdarray, int

Raises:

TypeError – Raised if pda is not a pdarray

name
property nbytes

The size of the pdarray in bytes.

Returns:

The size of the pdarray in bytes.

Return type:

int

ndim
objType = 'pdarray'
opeq(other, op)[source]
parity() pdarray[source]

Find the parity (XOR of all bits) in each element. See ak.parity.

popcount() pdarray[source]

Find the population (number of bits set) in each element. See ak.popcount.

pretty_print_info() None[source]

Prints information about all components of self in a human readable format

Parameters:

None

Return type:

None

prod(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the product of all elements in the array. Return value is always a np.float64 or np.int64.

register(user_defined_name: str) pdarray[source]

Register this pdarray with a user defined name in the arkouda server so it can be attached to later using pdarray.attach() This is an in-place operation, registering a pdarray more than once will update the name in the registry and remove the previously registered name. A name can only be registered to one pdarray at a time.

Parameters:

user_defined_name (str) – user defined name array is to be registered under

Returns:

The same pdarray which is now registered with the arkouda server and has an updated name. This is an in-place modification, the original is returned to support a fluid programming style. Please note you cannot register two different pdarrays with the same name.

Return type:

pdarray

Raises:
  • TypeError – Raised if user_defined_name is not a str

  • RegistrationError – If the server was unable to register the pdarray with the user_defined_name If the user is attempting to register more than one pdarray with the same name, the former should be unregistered first to free up the registration name.

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
registered_name: str | None = None
reshape(*shape)[source]

Gives a new shape to an array without changing its data.

Parameters:

shape (int, tuple of ints, or pdarray) – The new shape should be compatible with the original shape.

Returns:

a pdarray with the same data, reshaped to the new shape

Return type:

pdarray

rotl(other) pdarray[source]

Rotate bits left by <other>.

rotr(other) pdarray[source]

Rotate bits right by <other>.

save(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None, file_format: str = 'HDF5', file_type: str = 'distribute') str[source]

DEPRECATED Save the pdarray to HDF5 or Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. HDF5 support single files, in which case the file name will only be that provided. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

  • file_format (str {'HDF5', 'Parquet'}) – By default, saved files will be written to the HDF5 file format. If ‘Parquet’, the files will be written to the Parquet file format. This is case insensitive.

  • file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:
  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

  • ValueError – Raised if there is an error in parsing the prefix path pointing to file write location or if the mode parameter is neither truncate nor append

  • TypeError – Raised if any one of the prefix_path, dataset, or mode parameters is not a string

Notes

The prefix_path must be visible to the arkouda server and the user must have write permission. Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales. If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. Previously all files saved in Parquet format were saved with a .parquet file extension. This will require you to use load as if you saved the file with the extension. Try this if an older file is not being found. Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.save('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.save('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving with an extension (Parquet)
>>> a.save('path/prefix.parquet', dataset='array', file_format='Parquet')
Saves the array in numLocales Parquet files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
property shape

Return the shape of an array.

Returns:

The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of int

size
slice_bits(low, high) pdarray[source]

Returns a pdarray containing only bits from low to high of self.

This is zero indexed and inclusive on both ends, so slicing the bottom 64 bits is pda.slice_bits(0, 63)

Parameters:
  • low (int) – The lowest bit included in the slice (inclusive) zero indexed, so the first bit is 0

  • high (int) – The highest bit included in the slice (inclusive)

Returns:

A new pdarray containing the bits of self from low to high

Return type:

pdarray

Raises:

RuntimeError – Raised if there is a server-side error thrown

Examples

>>> p = ak.array([2**65 + (2**64 - 1)])
>>> bin(p[0])
'0b101111111111111111111111111111111111111111111111111111111111111111'
>>> bin(p.slice_bits(64, 65)[0])
'0b10'
std(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the standard deviation. See arkouda.std for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • RuntimeError – Raised if there’s a server-side error thrown

sum(axis: int | Tuple[int, Ellipsis] | None = None, keepdims: bool = False) arkouda.numpy.dtypes.numpy_scalars | pdarray[source]

Return the sum of all elements in the array.

to_csv(prefix_path: str, dataset: str = 'array', col_delim: str = ',', overwrite: bool = False)[source]

Write pdarray to CSV file(s). File will contain a single column with the pdarray data. All CSV Files written by Arkouda include a header denoting data types of the columns.

prefix_path: str

The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

dataset: str

Column name to save the pdarray under. Defaults to “array”.

col_delim: str

Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

overwrite: bool

Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

str reponse message

ValueError

Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

RuntimeError

Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

TypeError

Raised if we receive an unknown arkouda_type returned from the server

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (`

`) at this time.

to_cuda()[source]

Convert the array to a Numba DeviceND array, transferring array data from the arkouda server to Python via ndarray. If the array exceeds a builtin size limit, a RuntimeError is raised.

Returns:

A Numba ndarray with the same attributes and data as the pdarray; on GPU

Return type:

numba.DeviceNDArray

Raises:
  • ImportError – Raised if CUDA is not available

  • ModuleNotFoundError – Raised if Numba is either not installed or not enabled

  • RuntimeError – Raised if there is a server-side error thrown in the course of retrieving the pdarray.

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_cuda()
array([0, 1, 2, 3, 4])
>>> type(a.to_cuda())
numpy.devicendarray
to_hdf(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', file_type: str = 'distribute') str[source]

Save the pdarray to HDF5. The object can be saved to a collection of files or single file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

file_type (str ("single" | "distribute")) – Default: “distribute” When set to single, dataset is written to a single file. When distribute, dataset is written on a file per locale. This is only supported by HDF5 files and will have no impact of Parquet Files.

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. Otherwise, the file name will be prefix_path. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_hdf('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_hdf('path/prefix.h5', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.h5`` where #### is replaced by each locale number
>>> # Saving to a single file
>>> a.to_hdf('path/prefix.hdf5', dataset='array', file_type='single')
Saves the array in to single hdf5 file on the root node.
``cwd/path/name_prefix.hdf5``
to_list() List[source]

Convert the array to a list, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A list with the same data as the pdarray

Return type:

list

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

to_ndarray

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_list()
[0, 1, 2, 3, 4]
>>> type(a.to_list())
list
to_ndarray() numpy.ndarray[source]

Convert the array to a np.ndarray, transferring array data from the Arkouda server to client-side Python. Note: if the pdarray size exceeds client.maxTransferBytes, a RuntimeError is raised.

Returns:

A numpy ndarray with the same attributes and data as the pdarray

Return type:

np.ndarray

Raises:

RuntimeError – Raised if there is a server-side error thrown, if the pdarray size exceeds the built-in client.maxTransferBytes size limit, or if the bytes received does not match expected number of bytes

Notes

The number of bytes in the array cannot exceed client.maxTransferBytes, otherwise a RuntimeError will be raised. This is to protect the user from overflowing the memory of the system on which the Python client is running, under the assumption that the server is running on a distributed system with much more memory than the client. The user may override this limit by setting client.maxTransferBytes to a larger value, but proceed with caution.

See also

array, to_list

Examples

>>> a = ak.arange(0, 5, 1)
>>> a.to_ndarray()
array([0, 1, 2, 3, 4])
>>> type(a.to_ndarray())
numpy.ndarray
to_parquet(prefix_path: str, dataset: str = 'array', mode: str = 'truncate', compression: str | None = None) str[source]

Save the pdarray to Parquet. The result is a collection of files, one file per locale of the arkouda server, where each filename starts with prefix_path. Each locale saves its chunk of the array to its corresponding file. :param prefix_path: Directory and filename prefix that all output files share :type prefix_path: str :param dataset: Name of the dataset to create in files (must not already exist) :type dataset: str :param mode: By default, truncate (overwrite) output files, if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:

compression (str (Optional)) – (None | “snappy” | “gzip” | “brotli” | “zstd” | “lz4”) Sets the compression type used with Parquet files

Return type:

string message indicating result of save operation

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • The prefix_path must be visible to the arkouda server and the user must

have write permission. - Output files have names of the form <prefix_path>_LOCALE<i>, where <i> ranges from 0 to numLocales for file_type=’distribute’. - ‘append’ write mode is supported, but is not efficient. - If any of the output files already exist and the mode is ‘truncate’, they will be overwritten. If the mode is ‘append’ and the number of output files is less than the number of locales or a dataset with the same name already exists, a RuntimeError will result. - Any file extension can be used.The file I/O does not rely on the extension to determine the file format.

Examples

>>> a = ak.arange(25)
>>> # Saving without an extension
>>> a.to_parquet('path/prefix', dataset='array')
Saves the array to numLocales HDF5 files with the name ``cwd/path/name_prefix_LOCALE####``
>>> # Saving with an extension (HDF5)
>>> a.to_parqet('path/prefix.parquet', dataset='array')
Saves the array to numLocales HDF5 files with the name
``cwd/path/name_prefix_LOCALE####.parquet`` where #### is replaced by each locale number
transfer(hostname: str, port: arkouda.numpy.dtypes.int_scalars)[source]

Sends a pdarray to a different Arkouda server

Parameters:
  • hostname (str) – The hostname where the Arkouda server intended to receive the pdarray is running.

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to ak.receive_array().

Return type:

A message indicating a complete transfer

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

unregister() None[source]

Unregister a pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach()

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.pdarray.attach("my_zeros")
>>> # ...other work...
>>> b.unregister()
update_hdf(prefix_path: str, dataset: str = 'array', repack: bool = True)[source]

Overwrite the dataset with the name provided with this pdarray. If the dataset does not exist it is added

Parameters:
  • prefix_path (str) – Directory and filename prefix that all output files share

  • dataset (str) – Name of the dataset to create in files

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Return type:

str - success message if successful

Raises:

RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the dataset provided does not exist, it will be added

value_counts()[source]

Count the occurrences of the unique values of self.

Returns:

  • unique_values (pdarray) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Examples

>>> ak.array([2, 0, 2, 4, 0, 0]).value_counts()
(array([0, 2, 4]), array([3, 2, 1]))
var(ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Compute the variance. See arkouda.var for details.

Parameters:

ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

arkouda.pi: float
arkouda.plot_dist(b, h, log=True, xlabel=None, newfig=True)[source]

Plot the distribution and cumulative distribution of histogram Data

Parameters:
  • b (np.ndarray) – Bin edges

  • h (np.ndarray) – Histogram data

  • log (bool) – use log to scale y

  • xlabel (str) – Label for the x axis of the graph

  • newfig (bool) – Generate a new figure or not

Notes

This function does not return or display the plot. A user must have matplotlib imported in addition to arkouda to display plots. This could be updated to return the object or have a flag to show the resulting plots. See Examples Below.

Examples

>>> import arkouda as ak
>>> from matplotlib import pyplot as plt
>>> b, h = ak.histogram(ak.arange(10), 3)
>>> ak.plot_dist(b, h.to_ndarray())
>>> # to show the plot
>>> plt.show()
arkouda.popcount(pda: pdarray) pdarray[source]

Find the population (number of bits set) for each integer in an array.

Parameters:

pda (pdarray, int64, uint64, bigint) – Input array (must be integral).

Returns:

population – The number of bits set (1) in each element

Return type:

pdarray

Raises:

TypeError – If input array is not int64, uint64, or bigint

Examples

>>> A = ak.arange(10)
>>> ak.popcount(A)
array([0, 1, 1, 2, 1, 2, 2, 3, 1, 2])
arkouda.power(pda: pdarray, pwr: int | float | pdarray, where: arkouda.numpy.dtypes.bool_scalars | pdarray = True) pdarray[source]

Raises an array to a power. If where is given, the operation will only take place in the positions where the where condition is True.

Note: Our implementation of the where argument deviates from numpy. The difference in behavior occurs at positions where the where argument contains a False. In numpy, these position will have uninitialized memory (which can contain anything and will vary between runs). We have chosen to instead return the value of the original array in these positions.

Parameters:
  • pda (pdarray) – A pdarray of values that will be raised to a power (pwr)

  • pwr (integer, float, or pdarray) – The power(s) that pda is raised to

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be raised to the respective power. Elsewhere, it will retain its original value. Default set to True.

Returns:

pdarray Returns a pdarray of values raised to a power, under the boolean where condition.

Examples

>>> a = ak.arange(5)
>>> ak.power(a, 3)
array([0, 1, 8, 27, 64])
>>> ak.power(a), 3, a % 2 == 0)
array([0, 1, 8, 3, 64])
arkouda.power_divergence(f_obs, f_exp=None, ddof=0, lambda_=None)[source]

Computes the power divergence statistic and p-value.

Parameters:
  • f_obs (pdarray) – The observed frequency.

  • f_exp (pdarray, default = None) – The expected frequency.

  • ddof (int) – The delta degrees of freedom.

  • lambda (string, default = "pearson") –

    The power in the Cressie-Read power divergence statistic. Allowed values: “pearson”, “log-likelihood”, “freeman-tukey”, “mod-log-likelihood”, “neyman”, “cressie-read”

    Powers correspond as follows:

    ”pearson”: 1

    ”log-likelihood”: 0

    ”freeman-tukey”: -0.5

    ”mod-log-likelihood”: -1

    ”neyman”: -2

    ”cressie-read”: 2 / 3

Return type:

arkouda.akstats.Power_divergenceResult

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.stats import power_divergence
>>> x = ak.array([10, 20, 30, 10])
>>> y = ak.array([10, 30, 20, 10])
>>> power_divergence(x, y, lambda_="pearson")
Power_divergenceResult(statistic=8.333333333333334, pvalue=0.03960235520756414)
>>> power_divergence(x, y, lambda_="log-likelihood")
Power_divergenceResult(statistic=8.109302162163285, pvalue=0.04380595350226197)

See also

scipy.stats.power_divergence, arkouda.akstats.chisquare

Notes

This is a modified version of scipy.stats.power_divergence [2] in order to scale using arkouda pdarrays.

References

[1] “scipy.stats.power_divergence”, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.power_divergence.html

[2] Scipy contributors (2024) scipy (Version v1.12.0) [Source code]. https://github.com/scipy/scipy

arkouda.pretty_print_information(names: List[str] | str = RegisteredSymbols) None[source]

Prints verbose information for each object in names in a human readable format

Parameters:

names (Union[List[str], str]) – names is either the name of an object or list of names of objects to retrieve info if names is ak.AllSymbols, retrieves info for all symbols in the symbol table if names is ak.RegisteredSymbols, retrieves info for all symbols in the registry

Return type:

None

Raises:

RuntimeError – Raised if a server-side error is thrown in the process of retrieving information about the objects in names

arkouda.putmask(A: arkouda.pdarrayclass.pdarray, mask: arkouda.pdarrayclass.pdarray, Values: arkouda.pdarrayclass.pdarray)[source]

Overwrites elements of A with elements from B based upon a mask array. Similar to numpy.putmask, where mask = False, A retains its original value, but where mask = True, A is overwritten with the corresponding entry from Values.

This is similar to ak.where, except that (1) no new pdarray is created, and (2) Values does not have to be the same size as A and mask.

Parameters:
  • A (pdarray) – Value(s) used when mask is False (see Notes for allowed dtypes)

  • mask (pdarray) – Used to choose values from A or B, must be same size as A, and of type ak.bool_

  • Values (pdarray) – Value(s) used when mask is False (see Notes for allowed dtypes)

Examples

>>> a = ak.array(np.arange(10))
>>> ak.putmask (a,a>2,a**2)
>>> a
array ([0,1,2,9,16,25,36,49,64,81])
>>> a = ak.array(np.arange(10))
>>> values = ak.array([3,2])
>>> ak.putmask (a,a>2,values)
>>> a
array ([0,1,2,2,3,2,3,2,3,2])
Raises:

RuntimeError – Raised if mask is not same size as A, or if A.dtype and Values.dtype are not an allowed pair (see Notes for details).

Notes

A and mask must be the same size. Values can be any size.

Allowed dtypes for A and Values conform to types accepted by numpy putmask.

If A is ak.float64, Values can be ak.float64, ak.int64, ak.uint64, ak.bool_.

If A is ak.int64, Values can be ak.int64 or ak.bool_.

If A is ak.uint64, Values can be ak.int64, ak.uint64, or ak.bool_.

If A is ak.bool_, Values must be ak.bool_.

Only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda

Only 1D pdarrays are implemented for now.

arkouda.rad2deg(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Converts angles element-wise from radians to degrees.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be converted from radians to degrees. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing an angle converted to degrees, from radians, for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.random_sparse_matrix(size: int, density: float, layout: str, dtype: type | str = int64) arkouda.sparrayclass.sparray[source]

Create a random sparse matrix with the specified number of rows and columns and the specified density. The density is the fraction of non-zero elements in the matrix. The non-zero elements are uniformly distributed random numbers in the range [0,1).

Parameters:
  • size (int) – The number of rows in the matrix, columns are equal to rows right now

  • density (float) – The fraction of non-zero elements in the matrix

  • dtype (Union[DTypes, str]) – The dtype of the elements in the matrix (default is int64)

Returns:

A sparse matrix with the specified number of rows and columns and the specified density

Return type:

sparray

Raises:

ValueError – Raised if density is not in the range [0,1]

arkouda.read(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, column_delim: str = ',', read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index][source]

Read datasets from files. File Type is determined automatically.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)

  • iterative (bool) – Iterative (True) or Single (False) function call(s) to server

  • strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.

  • column_delim (str) – Column delimiter to be used if dataset is CSV. Otherwise, unused.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.

  • has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.

  • fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:

RuntimeError – If invalid filetype is detected

Notes

If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.

If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.

If datasets is None, infer the names of datasets from the first file and read all of them. Use get_datasets to show the names of datasets to HDF5/Parquet files.

CSV files without the Arkouda Header are not supported.

Examples

Read with file Extension >>> x = ak.read(‘path/name_prefix.h5’) # load HDF5 - processing determines file type not extension Read without file Extension >>> x = ak.read(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read(‘path/name_prefix*’) # Reads HDF5

arkouda.read_csv(filenames: str | List[str], datasets: str | List[str] | None = None, column_delim: str = ',', allow_errors: bool = False) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index][source]

Read CSV file(s) into Arkouda objects. If more than one dataset is found, the objects will be returned in a dictionary mapping the dataset name to the Arkouda object containing the data. If the file contains the appropriately formatted header, typed data will be returned. Otherwise, all data will be returned as a Strings object.

Parameters:
  • filenames (str or List[str]) – The filenames to read data from

  • datasets (str or List[str] (Optional)) – names of the datasets to read. When None, all datasets will be read.

  • column_delim (str) – The delimiter for column names and data. Defaults to “,”.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

to_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.read_hdf(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, calc_string_offsets: bool = False, tag_data=False) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index][source]

Read Arkouda objects from HDF5 file/s

Parameters:
  • filenames (str, List[str]) – Filename/s to read objects from

  • datasets (Optional str, List[str]) – datasets to read from the provided files

  • iterative (bool) – Iterative (True) or Single (False) function call(s) to server

  • strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.

  • tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.

Returns:

Dictionary of {datasetName: pdarray, String, SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:
  • ValueError – Raised if all datasets are not present in all hdf5 files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.

If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.

If datasets is None, infer the names of datasets from the first file and read all of them. Use get_datasets to show the names of datasets to HDF5 files.

See also

read_tagged_data

Examples

>>>
# Read with file Extension
>>> x = ak.read_hdf('path/name_prefix.h5') # load HDF5
# Read Glob Expression
>>> x = ak.read_hdf('path/name_prefix*') # Reads HDF5
arkouda.read_parquet(filenames: str | List[str], datasets: str | List[str] | None = None, iterative: bool = False, strict_types: bool = True, allow_errors: bool = False, tag_data: bool = False, read_nested: bool = True, has_non_float_nulls: bool = False, fixed_len: int = -1) Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray | arkouda.categorical.Categorical | arkouda.dataframe.DataFrame | arkouda.client_dtypes.IPv4 | arkouda.timeclass.Datetime | arkouda.timeclass.Timedelta | arkouda.index.Index][source]

Read Arkouda objects from Parquet file/s

Parameters:
  • filenames (str, List[str]) – Filename/s to read objects from

  • datasets (Optional str, List[str]) – datasets to read from the provided files

  • iterative (bool) – Iterative (True) or Single (False) function call(s) to server

  • strict_types (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • tagData (bool) – Default False, if True tag the data with the code associated with the filename that the data was pulled from.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. If datasets is not None, this will be ignored.

  • has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.

  • fixed_len (int) – Default -1. This value can be set for reading Parquet string columns when the length of each string is known at runtime. This can allow for skipping byte calculation, which can have an impact on performance.

Returns:

Dictionary of {datasetName: pdarray, String, or SegArray}

Return type:

Returns a dictionary of Arkouda pdarrays, Arkouda Strings, or Arkouda Segarrays.

Raises:
  • ValueError – Raised if all datasets are not present in all parquet files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

Notes

If filenames is a string, it is interpreted as a shell expression (a single filename is a valid expression, so it will work) and is expanded with glob to read all matching files.

If iterative == True each dataset name and file names are passed to the server as independent sequential strings while if iterative == False all dataset names and file names are passed to the server in a single string.

If datasets is None, infer the names of datasets from the first file and read all of them. Use get_datasets to show the names of datasets to Parquet files.

Parquet always recomputes offsets at this time This will need to be updated once parquets workflow is updated

See also

read_tagged_data

Examples

Read without file Extension >>> x = ak.read_parquet(‘path/name_prefix.parquet’) # load Parquet Read Glob Expression >>> x = ak.read_parquet(‘path/name_prefix*’) # Reads Parquet

arkouda.read_tagged_data(filenames: str | List[str], datasets: str | List[str] | None = None, strictTypes: bool = True, allow_errors: bool = False, calc_string_offsets=False, read_nested: bool = True, has_non_float_nulls: bool = False)[source]

Read datasets from files and tag each record to the file it was read from. File Type is determined automatically.

Parameters:
  • filenames (list or str) – Either a list of filenames or shell expression

  • datasets (list or str or None) – (List of) name(s) of dataset(s) to read (default: all available)

  • strictTypes (bool) – If True (default), require all dtypes of a given dataset to have the same precision and sign. If False, allow dtypes of different precision and sign across different files. For example, if one file contains a uint32 dataset and another contains an int64 dataset with the same name, the contents of both will be read into an int64 pdarray.

  • allow_errors (bool) – Default False, if True will allow files with read errors to be skipped instead of failing. A warning will be included in the return containing the total number of files skipped due to failure and up to 10 filenames.

  • calc_string_offsets (bool) – Default False, if True this will tell the server to calculate the offsets/segments array on the server versus loading them from HDF5 files. In the future this option may be set to True as the default.

  • read_nested (bool) – Default True, when True, SegArray objects will be read from the file. When False, SegArray (or other nested Parquet columns) will be ignored. Ignored if datasets is not None Parquet Files only.

  • has_non_float_nulls (bool) – Default False. This flag must be set to True to read non-float parquet columns that contain null values.

Notes

Not currently supported for Categorical or GroupBy datasets

Examples

Read files and return data with tagging corresponding to the Categorical returned cat.codes will link the codes in data to the filename. Data will contain the code Filename_Codes >>> data, cat = ak.read_tagged_data(‘path/name’) >>> data {‘Filname_Codes’: array([0 3 6 9 12]), ‘col_name’: array([0 0 0 1])}

arkouda.read_zarr(store_path: str, ndim: int, dtype)[source]

Reads a Zarr store from disk into a pdarray. Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path to the Zarr store. The path must be to a directory that contains a .zarray file containing the Zarr store metadata.

  • ndim (int) – The number of dimensions in the array

  • dtype (str) – The data type of the array

Returns:

The pdarray read from the Zarr store.

Return type:

pdarray

arkouda.receive(hostname: str, port)[source]

Receive a pdarray sent by pdarray.transfer().

Parameters:
  • hostname (str) – The hostname of the pdarray that sent the array

  • port (int_scalars) – The port to send the array over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.transfer().

Returns:

The pdarray sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.receive_dataframe(hostname: str, port)[source]

Receive a pdarray sent by dataframe.transfer().

Parameters:
  • hostname (str) – The hostname of the dataframe that sent the array

  • port (int_scalars) – The port to send the dataframe over. This needs to be an open port (i.e., not one that the Arkouda server is running on). This will open up numLocales ports, each of which in succession, so will use ports of the range {port..(port+numLocales)} (e.g., running an Arkouda server of 4 nodes, port 1234 is passed as port, Arkouda will use ports 1234, 1235, 1236, and 1237 to send the array data). This port much match the port passed to the call to pdarray.send_array().

Returns:

The dataframe sent from the sending server to the current receiving server.

Return type:

pdarray

Raises:
  • ValueError – Raised if the op is not within the pdarray.BinOps set

  • TypeError – Raised if other is not a pdarray or the pdarray.dtype is not a supported dtype

arkouda.register_all(data: dict)[source]

Register all objects in the provided dictionary

Parameters:

data (dict) – Maps name to register the object to the object. For example, {“MyArray”: ak.array([0, 1, 2])

Return type:

None

arkouda.resolve_scalar_dtype(val: object) str[source]

Try to infer what dtype arkouda_server should treat val as.

arkouda.restore(filename)[source]

Return data saved using ak.snapshot

Parameters:
  • filename (str)

  • read (Name used to create snapshot to be)

Return type:

Dict

Notes

Unlike other save/load methods using snapshot restore will save DataFrames alongside other objects in HDF5. Thus, they are returned within the dictionary as a dataframe.

arkouda.right_align(left, right)[source]

Map two arrays of sparse values to the 0-up index set implied by the right array, discarding values from left that do not appear in right.

Parameters:
  • left (pdarray or a sequence of pdarrays) – Left-hand identifiers

  • right (pdarray or a sequence of pdarrays) – Right-hand identifiers that define the index

Returns:

  • keep (pdarray, bool) – Logical index of left-hand values that survived

  • aligned ((pdarray, pdarray)) – Left and right arrays with values replaced by 0-up indices

arkouda.rotl(x, rot) pdarray[source]

Rotate bits of <x> to the left by <rot>.

Parameters:
Returns:

rotated – The rotated elements of x.

Return type:

pdarray(int64/uint64)

Raises:

TypeError – If input array is not int64 or uint64

Examples

>>> A = ak.arange(10)
>>> ak.rotl(A, A)
array([0, 2, 8, 24, 64, 160, 384, 896, 2048, 4608])
arkouda.rotr(x, rot) pdarray[source]

Rotate bits of <x> to the left by <rot>.

Parameters:
Returns:

rotated – The rotated elements of x.

Return type:

pdarray(int64/uint64)

Raises:

TypeError – If input array is not int64 or uint64

Examples

>>> A = ak.arange(10)
>>> ak.rotr(1024 * A, A)
array([0, 512, 512, 384, 256, 160, 96, 56, 32, 18])
arkouda.round(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise rounding of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing input array elements rounded to the nearest integer

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.round(ak.array([1.1, 2.5, 3.14159]))
array([1, 3, 3])
arkouda.save_all(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, file_format='HDF5', mode: str = 'truncate', file_type: str = 'distribute', compression: str | None = None) None[source]

DEPRECATED Save multiple named pdarrays to HDF5/Parquet files. :param columns: Collection of arrays to save :type columns: dict or list of pdarrays :param prefix_path: Directory and filename prefix for output files :type prefix_path: str :param names: Dataset names for the pdarrays :type names: list of str :param file_format: ‘HDF5’ or ‘Parquet’. Defaults to hdf5 :type file_format: str :param mode: By default, truncate (overwrite) the output files if they exist.

If ‘append’, attempt to create new dataset in existing files.

Parameters:
  • file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale Only used with HDF5

  • compression (str (None | "snappy" | "gzip" | "brotli" | "zstd" | "lz4")) – Optional Select the compression to use with Parquet files. Only used with Parquet.

Return type:

None

Raises:

ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’

See also

save, load_all, to_parquet, to_hdf

Notes

Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> a = ak.arange(25)
>>> b = ak.arange(25)
>>> # Save with mapping defining dataset names
>>> ak.save_all({'a': a, 'b': b}, 'path/name_prefix', file_format='Parquet')
>>> # Save using names instead of mapping
>>> ak.save_all([a, b], 'path/name_prefix', names=['a', 'b'], file_format='Parquet')
class arkouda.sctypeDict

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

clear(*args, **kwargs)

D.clear() -> None. Remove all items from D.

copy(*args, **kwargs)

D.copy() -> a shallow copy of D

fromkeys(iterable, value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items(*args, **kwargs)

D.items() -> a set-like object providing a view on D’s items

keys(*args, **kwargs)

D.keys() -> a set-like object providing a view on D’s keys

pop(*args, **kwargs)

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If key is not found, default is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(*args, **kwargs)

D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(*args, **kwargs)

D.values() -> an object providing a view on D’s values

class arkouda.sctypes

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

clear(*args, **kwargs)

D.clear() -> None. Remove all items from D.

copy(*args, **kwargs)

D.copy() -> a shallow copy of D

fromkeys(iterable, value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items(*args, **kwargs)

D.items() -> a set-like object providing a view on D’s items

keys(*args, **kwargs)

D.keys() -> a set-like object providing a view on D’s keys

pop(*args, **kwargs)

D.pop(k[,d]) -> v, remove specified key and return the corresponding value.

If key is not found, default is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update(*args, **kwargs)

D.update([E, ]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values(*args, **kwargs)

D.values() -> an object providing a view on D’s values

arkouda.search_intervals(vals, intervals, tiebreak=None, hierarchical=True)[source]

Given an array of query vals and non-overlapping, closed intervals, return the index of the best (see tiebreak) interval containing each query value, or -1 if not present in any interval.

Parameters:
  • vals ((sequence of) pdarray(int, uint, float)) – Values to search for in intervals. If multiple arrays, each “row” is an item.

  • intervals (2-tuple of (sequences of) pdarrays) – Non-overlapping, half-open intervals, as a tuple of (lower_bounds_inclusive, upper_bounds_exclusive) Must have same dtype(s) as vals.

  • tiebreak ((optional) pdarray, numeric) – When a value is present in more than one interval, the interval with the lowest tiebreak value will be chosen. If no tiebreak is given, the first containing interval will be chosen.

  • hierarchical (boolean) – When True, sequences of pdarrays will be treated as components specifying a single dimension (i.e. hierarchical) When False, sequences of pdarrays will be specifying multi-dimensional intervals

Returns:

idx – Index of interval containing each query value, or -1 if not found

Return type:

pdarray(int64)

Notes

The return idx satisfies the following condition:

present = idx > -1 ((intervals[0][idx[present]] <= vals[present]) &

(intervals[1][idx[present]] >= vals[present])).all()

Examples

>>> starts = (ak.array([0, 5]), ak.array([0, 11]))
>>> ends = (ak.array([5, 9]), ak.array([10, 20]))
>>> vals = (ak.array([0, 0, 2, 5, 5, 6, 6, 9]), ak.array([0, 20, 1, 5, 15, 0, 12, 30]))
>>> ak.search_intervals(vals, (starts, ends), hierarchical=False)
array([0 -1 0 0 1 -1 1 -1])
>>> ak.search_intervals(vals, (starts, ends))
array([0 0 0 0 1 1 1 -1])
>>> bi_starts = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in starts])
>>> bi_ends = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in ends])
>>> bi_vals = ak.bigint_from_uint_arrays([ak.cast(a, ak.uint64) for a in vals])
>>> bi_starts, bi_ends, bi_vals
(array(["0" "92233720368547758091"]),
array(["92233720368547758090" "166020696663385964564"]),
array(["0" "20" "36893488147419103233" "92233720368547758085" "92233720368547758095"
"110680464442257309696" "110680464442257309708" "166020696663385964574"]))
>>> ak.search_intervals(bi_vals, (bi_starts, bi_ends))
array([0 0 0 0 1 1 1 -1])
arkouda.segarray(segments: arkouda.pdarrayclass.pdarray, values: arkouda.pdarrayclass.pdarray, lengths=None, grouping=None)[source]

Alias for the from_parts function. Prevents user from needing to call ak.SegArray constructor DEPRECATED

arkouda.setdiff1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable[source]

Find the set difference of two arrays.

Return the sorted, unique values in pda1 that are not in pda2.

Parameters:
  • pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects

  • pda2 (pdarray/List) – Input array/sequence of groupable objects

  • assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.

Returns:

Sorted 1D array/List of sorted pdarrays of values in pda1 that are not in pda2.

Return type:

pdarray/groupable

Raises:
  • TypeError – Raised if either pda1 or pda2 is not a pdarray

  • RuntimeError – Raised if the dtype of either pdarray is not supported

Notes

ak.setdiff1d is not supported for bool or float64 pdarrays

Examples

>>> a = ak.array([1, 2, 3, 2, 4, 1])
>>> b = ak.array([3, 4, 5, 6])
>>> ak.setdiff1d(a, b)
array([1, 2])
#Multi-Array Example
>>> a = ak.arange(1, 6)
>>> b = ak.array([1, 5, 3, 4, 2])
>>> c = ak.array([1, 4, 3, 2, 5])
>>> d = ak.array([1, 2, 3, 5, 4])
>>> multia = [a, a, a]
>>> multib = [b, c, d]
>>> ak.setdiff1d(multia, multib)
[array([2, 4, 5]), array([2, 4, 5]), array([2, 4, 5])]
arkouda.setxor1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable, assume_unique: bool = False) arkouda.pdarrayclass.pdarray | arkouda.groupbyclass.groupable[source]

Find the set exclusive-or (symmetric difference) of two arrays.

Return the sorted, unique values that are in only one (not both) of the input arrays.

Parameters:
  • pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects

  • pda2 (pdarray/List) – Input array/sequence of groupable objects

  • assume_unique (bool) – If True, the input arrays are both assumed to be unique, which can speed up the calculation. Default is False.

Returns:

Sorted 1D array/List of sorted pdarrays of unique values that are in only one of the input arrays.

Return type:

pdarray/groupable

Raises:
  • TypeError – Raised if either pda1 or pda2 is not a pdarray

  • RuntimeError – Raised if the dtype of either pdarray is not supported

Notes

ak.setxor1d is not supported for bool or float64 pdarrays

Examples

>>> a = ak.array([1, 2, 3, 2, 4])
>>> b = ak.array([2, 3, 5, 7, 5])
>>> ak.setxor1d(a,b)
array([1, 4, 5, 7])
#Multi-Array Example
>>> a = ak.arange(1, 6)
>>> b = ak.array([1, 5, 3, 4, 2])
>>> c = ak.array([1, 4, 3, 2, 5])
>>> d = ak.array([1, 2, 3, 5, 4])
>>> multia = [a, a, a]
>>> multib = [b, c, d]
>>> ak.setxor1d(multia, multib)
[array([2, 2, 4, 4, 5, 5]), array([2, 5, 2, 4, 4, 5]), array([2, 4, 5, 4, 2, 5])]
arkouda.shape(a: arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | bool | numpy.bool_ | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | numpy.str_ | str) Tuple[source]

Return the shape of an array.

Parameters:

a (pdarray) – Input array.

Returns:

shape – The elements of the shape tuple give the lengths of the corresponding array dimensions.

Return type:

tuple of ints

Examples

>>> import arkouda as ak
>>> ak.shape(ak.eye(3,2))
(3, 2)
>>> ak.shape([[1, 3]])
(1, 2)
>>> ak.shape([0])
(1,)
>>> ak.shape(0)
()
class arkouda.short(value)

Bases: numpy.signedinteger

Signed integer type, compatible with C short.

Character code:

'h'

Canonical name:

numpy.short

Alias on this platform (Linux x86_64):

numpy.int16: 16-bit signed integer (-32_768 to 32_767).

bit_count(*args, **kwargs)

int16.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.int16(127).bit_count()
7
>>> np.int16(-127).bit_count()
7
arkouda.sign(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise sign of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing sign values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.sign(ak.array([-10, -5, 0, 5, 10]))
array([-1, -1, 0, 1, 1])
class arkouda.signedinteger(value)

Bases: numpy.integer

Abstract base class of all signed integer scalar types.

arkouda.sin(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise sine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the sine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing sin for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

class arkouda.single(value)

Bases: numpy.floating

Single-precision floating-point number type, compatible with C float.

Character code:

'f'

Canonical name:

numpy.single

Alias on this platform (Linux x86_64):

numpy.float32: 32-bit-precision floating-point number type: sign bit, 8 bits exponent, 23 bits mantissa.

as_integer_ratio(*args, **kwargs)

single.as_integer_ratio() -> (int, int)

Return a pair of integers, whose ratio is exactly equal to the original floating point number, and with a positive denominator. Raise OverflowError on infinities and a ValueError on NaNs.

>>> np.single(10.0).as_integer_ratio()
(10, 1)
>>> np.single(0.0).as_integer_ratio()
(0, 1)
>>> np.single(-.25).as_integer_ratio()
(-1, 4)
is_integer(*args, **kwargs)

single.is_integer() -> bool

Return True if the floating point number is finite with integral value, and False otherwise.

Added in version 1.22.

>>> np.single(-2.0).is_integer()
True
>>> np.single(3.2).is_integer()
False
arkouda.sinh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise hyperbolic sine of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the hyperbolic sine will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing hyperbolic sine for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.skew(pda: pdarray, bias: bool = True) numpy.float64[source]

Computes the sample skewness of an array. Skewness > 0 means there’s greater weight in the right tail of the distribution. Skewness < 0 means there’s greater weight in the left tail of the distribution. Skewness == 0 means the data is normally distributed. Based on the scipy.stats.skew function.

Parameters:
  • pda (pdarray) – A pdarray of values that will be calculated to find the skew

  • bias (bool, optional) – If False, then the calculations are corrected for statistical bias.

Returns:

  • np.float64

    The skew of all elements in the array

  • Examples

  • >>> a = ak.array([1, 1, 1, 5, 10])

  • >>> ak.skew(a)

  • 0.9442193396379163

arkouda.snapshot(filename)[source]

Create a snapshot of the current Arkouda namespace. All currently accessible variables containing Arkouda objects will be written to an HDF5 file.

Unlike other save/load functions, this maintains the integrity of dataframes.

Current Variable names are used as the dataset name when saving.

Parameters:
  • filename (str)

  • file (Name to use when storing)

Return type:

None

See also

ak.restore

arkouda.sort(pda: arkouda.pdarrayclass.pdarray, algorithm: SortingAlgorithm = SortingAlgorithm.RadixSortLSD, axis=-1) arkouda.pdarrayclass.pdarray[source]

Return a sorted copy of the array. Only sorts numeric arrays; for Strings, use argsort.

Parameters:

pda (pdarray or Categorical) – The array to sort (int64, uint64, or float64)

Returns:

The sorted copy of pda

Return type:

pdarray, int64, uint64, or float64

Raises:
  • TypeError – Raised if the parameter is not a pdarray

  • ValueError – Raised if sort attempted on a pdarray with an unsupported dtype such as bool

See also

argsort

Notes

Uses a least-significant-digit radix sort, which is stable and resilient to non-uniformity in data but communication intensive.

Examples

>>> a = ak.randint(0, 10, 10)
>>> sorted = ak.sort(a)
>>> a
array([0, 1, 1, 3, 4, 5, 7, 8, 8, 9])
class arkouda.sparray(name: str, mydtype: numpy.dtype | str, size: arkouda.numpy.dtypes.int_scalars, nnz: arkouda.numpy.dtypes.int_scalars, ndim: arkouda.numpy.dtypes.int_scalars, shape: Sequence[int], layout: str, itemsize: arkouda.numpy.dtypes.int_scalars, max_bits: int | None = None)[source]

The class for sparse arrays. This class contains only the attributies of the array; the data resides on the arkouda server. When a server operation results in a new array, arkouda will create a sparray instance that points to the array data on the server. As such, the user should not initialize sparray instances directly.

name

The server-side identifier for the array

Type:

str

dtype

The element type of the array

Type:

dtype

size

The size of any one dimension of the array (all dimensions are assumed to be equal sized for now)

Type:

int_scalars

nnz

The number of non-zero elements in the array

Type:

int_scalars

ndim

The rank of the array (currently only rank 2 arrays supported)

Type:

int_scalars

shape

A list or tuple containing the sizes of each dimension of the array

Type:

Sequence[int]

layout

The layout of the array (“CSR” or “CSC” are the only valid values)

Type:

str

itemsize

The size in bytes of each element

Type:

int_scalars

dtype
fill_vals(a: arkouda.pdarrayclass.pdarray)[source]
itemsize
layout
name
ndim
nnz
shape
size
to_pdarray() List[arkouda.pdarrayclass.pdarray][source]
arkouda.sparse_matrix_matrix_mult(A, B: arkouda.sparrayclass.sparray) arkouda.sparrayclass.sparray[source]

Multiply two sparse matrices.

Parameters:
  • A (sparray) – The left-hand sparse matrix

  • B (sparray) – The right-hand sparse matrix

Returns:

The product of the two sparse matrices

Return type:

sparray

arkouda.sqrt(pda: pdarray, where: arkouda.numpy.dtypes.bool_scalars | pdarray = True) pdarray[source]

Takes the square root of array. If where is given, the operation will only take place in the positions where the where condition is True.

Parameters:
  • pda (pdarray) – A pdarray of values that will be square rooted

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the corresponding value will be square rooted. Elsewhere, it will retain its original value. Default set to True.

Returns:

  • pdarray Returns a pdarray of square rooted values, under the boolean where condition.

  • Examples

  • >>> a = ak.arange(5)

  • >>> ak.sqrt(a)

  • array([0 1 1.4142135623730951 1.7320508075688772 2])

  • >>> ak.sqrt(a, ak.sqrt([True, True, False, False, True]))

  • array([0, 1, 2, 3, 2])

arkouda.square(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise square of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing square values of the input array elements

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.square(ak.arange(1,5))
array([1, 4, 9, 16])
arkouda.squeeze(x: arkouda.pdarrayclass.pdarray | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | bool | numpy.bool_, /, axis: NoneType | int | Tuple[int, Ellipsis] = None) arkouda.pdarrayclass.pdarray[source]

Remove degenerate (size one) dimensions from an array.

Parameters:
  • x (pdarray) – The array to squeeze

  • axis (int or Tuple[int, ...]) – The axis or axes to squeeze (must have a size of one). If axis = None, all dimensions of size 1 will be squeezed.

Returns:

A copy of x with the dimensions specified in the axis argument removed.

Return type:

pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> x = ak.arange(10).reshape((1, 10, 1))
>>> x
array([array([array([0]) array([1]) array([2]) array([3])....
 array([4]) array([5]) array([6]) array([7]) array([8]) array([9])])])
>>> x.shape
(1, 10, 1)
>>> ak.squeeze(x,axis=None)
array([0 1 2 3 4 5 6 7 8 9])
>>> ak.squeeze(x,axis=None).shape
(10,)
>>> ak.squeeze(x,axis=2)
array([array([0 1 2 3 4 5 6 7 8 9])])
>>> ak.squeeze(x,axis=2).shape
(1, 10)
>>> ak.squeeze(x,axis=(0,2))
array([0 1 2 3 4 5 6 7 8 9])
>>> ak.squeeze(x,axis=(0,2)).shape
(10,)
arkouda.std(pda: pdarray, ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Return the standard deviation of values in the array. The standard deviation is implemented as the square root of the variance.

Parameters:
  • pda (pdarray) – values for which to calculate the standard deviation

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating std

Returns:

The scalar standard deviation of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance or ddof is not an integer

  • ValueError – Raised if ddof is an integer < 0

  • RuntimeError – Raised if there’s a server-side error thrown

See also

mean, var

Notes

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean((x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

class arkouda.str_

A unicode string.

This type strips trailing null codepoints.

>>> s = np.str_("abc\x00")
>>> s
'abc'

Unlike the builtin str, this supports the python:bufferobjects, exposing its contents as UCS4:

>>> m = memoryview(np.str_("abc"))
>>> m.format
'3w'
>>> m.tobytes()
b'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'
Character code:

'U'

Alias:

numpy.unicode_

T(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.T.

all(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.all.

any(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.any.

argmax(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmax.

argmin(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmin.

argsort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argsort.

astype(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.astype.

base(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.base.

byteswap(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.byteswap.

choose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.choose.

clip(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.clip.

compress(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.compress.

conj(*args, **kwargs)
conjugate(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.conjugate.

copy(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.copy.

cumprod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumprod.

cumsum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumsum.

data(*args, **kwargs)

Pointer to start of data.

diagonal(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.diagonal.

dtype(*args, **kwargs)

Get array data-descriptor.

dump(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dump.

dumps(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dumps.

fill(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.fill.

flags(*args, **kwargs)

The integer value of flags.

flat(*args, **kwargs)

A 1-D view of the scalar.

flatten(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.flatten.

getfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.getfield.

imag(*args, **kwargs)

The imaginary part of the scalar.

item(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.item.

itemset(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.itemset.

itemsize(*args, **kwargs)

The length of one element in bytes.

max(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.max.

mean(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.mean.

min(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.min.

nbytes(*args, **kwargs)

The length of the scalar in bytes.

ndim(*args, **kwargs)

The number of array dimensions.

newbyteorder(*args, **kwargs)

newbyteorder(new_order=’S’, /)

Return a new dtype with a different byte order.

Changes are also made in all fields and sub-arrays of the data type.

The new_order code can be any from the following:

  • ‘S’ - swap dtype from current to opposite endian

  • {‘<’, ‘little’} - little endian

  • {‘>’, ‘big’} - big endian

  • {‘=’, ‘native’} - native order

  • {‘|’, ‘I’} - ignore (no change to byte order)

new_orderstr, optional

Byte order to force; a value from the byte order specifications above. The default value (‘S’) results in swapping the current byte order.

new_dtypedtype

New dtype object with the given change to the byte order.

nonzero(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.nonzero.

prod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.prod.

ptp(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ptp.

put(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.put.

ravel(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ravel.

real(*args, **kwargs)

The real part of the scalar.

repeat(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.repeat.

reshape(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.reshape.

resize(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.resize.

round(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.round.

searchsorted(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.searchsorted.

setfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setfield.

setflags(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setflags.

shape(*args, **kwargs)

Tuple of array dimensions.

size(*args, **kwargs)

The number of elements in the gentype.

sort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sort.

squeeze(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.squeeze.

std(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.std.

strides(*args, **kwargs)

Tuple of bytes steps in each dimension.

sum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sum.

swapaxes(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.swapaxes.

take(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.take.

tobytes(*args, **kwargs)
tofile(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tofile.

tolist(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tolist.

tostring(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tostring.

trace(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.trace.

transpose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.transpose.

var(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.var.

view(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.view.

class arkouda.str_

A unicode string.

This type strips trailing null codepoints.

>>> s = np.str_("abc\x00")
>>> s
'abc'

Unlike the builtin str, this supports the python:bufferobjects, exposing its contents as UCS4:

>>> m = memoryview(np.str_("abc"))
>>> m.format
'3w'
>>> m.tobytes()
b'a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00'
Character code:

'U'

Alias:

numpy.unicode_

T(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.T.

all(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.all.

any(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.any.

argmax(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmax.

argmin(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argmin.

argsort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.argsort.

astype(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.astype.

base(*args, **kwargs)

Scalar attribute identical to the corresponding array attribute.

Please see ndarray.base.

byteswap(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.byteswap.

choose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.choose.

clip(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.clip.

compress(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.compress.

conj(*args, **kwargs)
conjugate(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.conjugate.

copy(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.copy.

cumprod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumprod.

cumsum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.cumsum.

data(*args, **kwargs)

Pointer to start of data.

diagonal(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.diagonal.

dtype(*args, **kwargs)

Get array data-descriptor.

dump(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dump.

dumps(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.dumps.

fill(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.fill.

flags(*args, **kwargs)

The integer value of flags.

flat(*args, **kwargs)

A 1-D view of the scalar.

flatten(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.flatten.

getfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.getfield.

imag(*args, **kwargs)

The imaginary part of the scalar.

item(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.item.

itemset(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.itemset.

itemsize(*args, **kwargs)

The length of one element in bytes.

max(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.max.

mean(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.mean.

min(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.min.

nbytes(*args, **kwargs)

The length of the scalar in bytes.

ndim(*args, **kwargs)

The number of array dimensions.

newbyteorder(*args, **kwargs)

newbyteorder(new_order=’S’, /)

Return a new dtype with a different byte order.

Changes are also made in all fields and sub-arrays of the data type.

The new_order code can be any from the following:

  • ‘S’ - swap dtype from current to opposite endian

  • {‘<’, ‘little’} - little endian

  • {‘>’, ‘big’} - big endian

  • {‘=’, ‘native’} - native order

  • {‘|’, ‘I’} - ignore (no change to byte order)

new_orderstr, optional

Byte order to force; a value from the byte order specifications above. The default value (‘S’) results in swapping the current byte order.

new_dtypedtype

New dtype object with the given change to the byte order.

nonzero(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.nonzero.

prod(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.prod.

ptp(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ptp.

put(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.put.

ravel(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.ravel.

real(*args, **kwargs)

The real part of the scalar.

repeat(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.repeat.

reshape(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.reshape.

resize(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.resize.

round(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.round.

searchsorted(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.searchsorted.

setfield(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setfield.

setflags(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.setflags.

shape(*args, **kwargs)

Tuple of array dimensions.

size(*args, **kwargs)

The number of elements in the gentype.

sort(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sort.

squeeze(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.squeeze.

std(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.std.

strides(*args, **kwargs)

Tuple of bytes steps in each dimension.

sum(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.sum.

swapaxes(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.swapaxes.

take(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.take.

tobytes(*args, **kwargs)
tofile(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tofile.

tolist(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tolist.

tostring(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.tostring.

trace(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.trace.

transpose(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.transpose.

var(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.var.

view(*args, **kwargs)

Scalar method identical to the corresponding array attribute.

Please see ndarray.view.

class arkouda.str_scalars(origin, params, *, inst=True, name=None)

Bases: _GenericAlias

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

arkouda.string_operators(cls)[source]
arkouda.tan(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise tangent of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the tangent will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing tangent for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

arkouda.tanh(pda: arkouda.pdarrayclass.pdarray, where: bool | arkouda.pdarrayclass.pdarray = True) arkouda.pdarrayclass.pdarray[source]

Return the element-wise hyperbolic tangent of the array.

Parameters:
  • pda (pdarray)

  • where (Boolean or pdarray) – This condition is broadcast over the input. At locations where the condition is True, the hyperbolic tangent will be applied to the corresponding value. Elsewhere, it will retain its original value. Default set to True.

Returns:

A pdarray containing hyperbolic tangent for each element of the original pdarray

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

class arkouda.timedelta64(value)

Bases: numpy.signedinteger

A timedelta stored as a 64-bit integer.

See arrays.datetime for more information.

Character code:

'm'

arkouda.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, **kwargs)[source]

Return a fixed frequency TimedeltaIndex, with day as the default frequency. Alias for ak.Timedelta(pd.timedelta_range(args)). Subject to size limit imposed by client.maxTransferBytes.

Parameters:
  • start (str or timedelta-like, default None) – Left bound for generating timedeltas.

  • end (str or timedelta-like, default None) – Right bound for generating timedeltas.

  • periods (int, default None) – Number of periods to generate.

  • freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.

  • name (str, default None) – Name of the resulting TimedeltaIndex.

  • closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).

Returns:

rng

Return type:

TimedeltaIndex

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting TimedeltaIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

arkouda.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, **kwargs)[source]

Return a fixed frequency TimedeltaIndex, with day as the default frequency. Alias for ak.Timedelta(pd.timedelta_range(args)). Subject to size limit imposed by client.maxTransferBytes.

Parameters:
  • start (str or timedelta-like, default None) – Left bound for generating timedeltas.

  • end (str or timedelta-like, default None) – Right bound for generating timedeltas.

  • periods (int, default None) – Number of periods to generate.

  • freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.

  • name (str, default None) – Name of the resulting TimedeltaIndex.

  • closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).

Returns:

rng

Return type:

TimedeltaIndex

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting TimedeltaIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

arkouda.to_csv(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings], prefix_path: str, names: List[str] | None = None, col_delim: str = ',', overwrite: bool = False)[source]

Write Arkouda object(s) to CSV file(s). All CSV Files written by Arkouda include a header denoting data types of the columns.

Parameters:
  • columns (Mapping[str, pdarray] or List[pdarray]) – The objects to be written to CSV file. If a mapping is used and names is None the keys of the mapping will be used as the dataset names.

  • prefix_path (str) – The filename prefix to be used for saving files. Files will have _LOCALE#### appended when they are written to disk.

  • names (List[str] (Optional)) – names of dataset to be written. Order should correspond to the order of data provided in columns.

  • col_delim (str) – Defaults to “,”. Value to be used to separate columns within the file. Please be sure that the value used DOES NOT appear in your dataset.

  • overwrite (bool) – Defaults to False. If True, any existing files matching your provided prefix_path will be overwritten. If False, an error will be returned if existing files are found.

Return type:

None

Raises:
  • ValueError – Raised if any datasets are present in all csv files or if one or more of the specified files do not exist

  • RuntimeError – Raised if one or more of the specified files cannot be opened. If allow_errors is true this may be raised if no values are returned from the server.

  • TypeError – Raised if we receive an unknown arkouda_type returned from the server

See also

read_csv

Notes

  • CSV format is not currently supported by load/load_all operations

  • The column delimiter is expected to be the same for column names and data

  • Be sure that column delimiters are not found within your data.

  • All CSV files must delimit rows using newline (\n) at this time.

  • Unlike other file formats, CSV files store Strings as their UTF-8 format instead of storing bytes as uint(8).

arkouda.to_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: str = 'truncate', file_type: str = 'distribute') None[source]

Save multiple named pdarrays to HDF5 files.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files.

  • file_type (str ("single" | "distribute")) – Default: distribute Single writes the dataset to a single file Distribute writes the dataset to a file per locale

Return type:

None

Raises:
  • ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’

  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

Notes

Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the HDF5 dataset names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> a = ak.arange(25)
>>> b = ak.arange(25)
>>> # Save with mapping defining dataset names
>>> ak.to_hdf({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping
>>> ak.to_hdf([a, b], 'path/name_prefix', names=['a', 'b'])
arkouda.to_parquet(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, mode: str = 'truncate', compression: str | None = None, convert_categoricals: bool = False) None[source]

Save multiple named pdarrays to Parquet files.

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • mode ({'truncate' | 'append'}) – By default, truncate (overwrite) the output files if they exist. If ‘append’, attempt to create new dataset in existing files. ‘append’ is deprecated, please use the multi-column write

  • compression (str (Optional)) –

    Default None

    Provide the compression type to use when writing the file. Supported values: snappy, gzip, brotli, zstd, lz4

    convert_categoricals: bool

    Defaults to False Parquet requires all columns to be the same size and Categoricals don’t satisfy that requirement. if set, write the equivalent Strings in place of any Categorical columns.

Return type:

None

Raises:
  • ValueError – Raised if (1) the lengths of columns and values differ or (2) the mode is not ‘truncate’ or ‘append’

  • RuntimeError – Raised if a server-side error is thrown saving the pdarray

See also

to_hdf, load, load_all, read

Notes

Creates one file per locale containing that locale’s chunk of each pdarray. If columns is a dictionary, the keys are used as the Parquet column names. Otherwise, if no names are supplied, 0-up integers are used. By default, any existing files at path_prefix will be overwritten, unless the user specifies the ‘append’ mode, in which case arkouda will attempt to add <columns> as new datasets to existing files. If the wrong number of files is present or dataset names already exist, a RuntimeError is raised.

Examples

>>> a = ak.arange(25)
>>> b = ak.arange(25)
>>> # Save with mapping defining dataset names
>>> ak.to_parquet({'a': a, 'b': b}, 'path/name_prefix')
>>> # Save using names instead of mapping
>>> ak.to_parquet([a, b], 'path/name_prefix', names=['a', 'b'])
arkouda.to_zarr(store_path: str, arr: arkouda.pdarrayclass.pdarray, chunk_shape)[source]

Writes a pdarray to disk as a Zarr store. Supports multi-dimensional pdarrays of numeric types. To use this function, ensure you have installed the blosc dependency (make install-blosc) and have included ZarrMsg.chpl in the ServerModules.cfg file.

Parameters:
  • store_path (str) – The path at which Zarr store should be written

  • arr (pdarray) – The pdarray to be written to disk

  • chunk_shape (tuple) – The shape of the chunks to be used in the Zarr store

Raises:

ValueError – Raised if the number of dimensions in the chunk shape does not match the number of dimensions in the array or if the array is not a 32 or 64 bit numeric type

arkouda.transpose(pda: arkouda.pdarrayclass.pdarray)[source]

Compute the transpose of a matrix.

Parameters:

pda (pdarray)

Returns:

the transpose of the input matrix

Return type:

pdarray

Examples

>>> a = ak.array([[1,2,3,4,5],[1,2,3,4,5]])
>>> ak.transpose(a)
array([array([1 1]) array([2 2]) array([3 3]) array([4 4]) array([5 5])])

Notes

Server returns an error if rank of pda < 2

arkouda.tril(pda: arkouda.pdarrayclass.pdarray, diag: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 = 0)[source]

Return a copy of the pda with the upper triangle zeroed out

Parameters:
  • pda (pdarray)

  • diag (int_scalars) – if diag = 0, zeros start just below the main diagonal if diag = 1, zeros start at the main diagonal if diag = 2, zeros start just above the main diagonal etc.

Returns:

a copy of pda with zeros in the upper triangle

Return type:

pdarray

Examples

>>> a = ak.array([[1,2,3,4,5],[2,3,4,5,6],[3,4,5,6,7],[4,5,6,7,8],[5,6,7,8,9]])
>>> ak.tril(a,diag=4)
array([array([1 2 3 4 5]) array([2 3 4 5 6]) array([3 4 5 6 7])
array([4 5 6 7 8]) array([5 6 7 8 9])])
>>> ak.tril(a,diag=3)
array([array([1 2 3 4 0]) array([2 3 4 5 6]) array([3 4 5 6 7])
array([4 5 6 7 8]) array([5 6 7 8 9])])
>>> ak.tril(a,diag=2)
array([array([1 2 3 0 0]) array([2 3 4 5 0]) array([3 4 5 6 7])
array([4 5 6 7 8]) array([5 6 7 8 9])])
>>> ak.tril(a,diag=1)
array([array([1 2 0 0 0]) array([2 3 4 0 0]) array([3 4 5 6 0])
array([4 5 6 7 8]) array([5 6 7 8 9])])
>>> ak.tril(a,diag=0)
array([array([1 0 0 0 0]) array([2 3 0 0 0]) array([3 4 5 0 0])
array([4 5 6 7 0]) array([5 6 7 8 9])])

Notes

Server returns an error if rank of pda < 2

arkouda.triu(pda: arkouda.pdarrayclass.pdarray, diag: int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 = 0)[source]

Return a copy of the pda with the lower triangle zeroed out

Parameters:
  • pda (pdarray)

  • diag (int_scalars) – if diag = 0, zeros start just above the main diagonal if diag = 1, zeros start at the main diagonal if diag = 2, zeros start just below the main diagonal etc.

Returns:

a copy of pda with zeros in the lower triangle

Return type:

pdarray

Examples

>>> a = ak.array([[1,2,3,4,5],[2,3,4,5,6],[3,4,5,6,7],[4,5,6,7,8],[5,6,7,8,9]])
>>> ak.triu(a,diag=0)
array([array([1 2 3 4 5]) array([0 3 4 5 6]) array([0 0 5 6 7])
array([0 0 0 7 8]) array([0 0 0 0 9])])
>>> ak.triu(a,diag=1)
array([array([0 2 3 4 5]) array([0 0 4 5 6]) array([0 0 0 6 7])
array([0 0 0 0 8]) array([0 0 0 0 0])])
>>> ak.triu(a,diag=2)
array([array([0 0 3 4 5]) array([0 0 0 5 6]) array([0 0 0 0 7])
array([0 0 0 0 0]) array([0 0 0 0 0])])
>>> ak.triu(a,diag=3)
array([array([0 0 0 4 5]) array([0 0 0 0 6]) array([0 0 0 0 0])
array([0 0 0 0 0]) array([0 0 0 0 0])])
>>> ak.triu(a,diag=4)
array([array([0 0 0 0 5]) array([0 0 0 0 0]) array([0 0 0 0 0])
array([0 0 0 0 0]) array([0 0 0 0 0])])

Notes

Server returns an error if rank of pda < 2

arkouda.trunc(pda: arkouda.pdarrayclass.pdarray) arkouda.pdarrayclass.pdarray[source]

Return the element-wise truncation of the array.

Parameters:

pda (pdarray)

Returns:

A pdarray containing input array elements truncated to the nearest integer

Return type:

pdarray

Raises:

TypeError – Raised if the parameter is not a pdarray

Examples

>>> ak.trunc(ak.array([1.1, 2.5, 3.14159]))
array([1, 2, 3])
arkouda.typename(char)

Return a description for the given data type code.

Parameters:

char (str) – Data type code.

Returns:

out – Description of the input data type code.

Return type:

str

See also

dtype, typecodes

Examples

>>> typechars = ['S1', '?', 'B', 'D', 'G', 'F', 'I', 'H', 'L', 'O', 'Q',
...              'S', 'U', 'V', 'b', 'd', 'g', 'f', 'i', 'h', 'l', 'q']
>>> for typechar in typechars:
...     print(typechar, ' : ', np.typename(typechar))
...
S1  :  character
?  :  bool
B  :  unsigned char
D  :  complex double precision
G  :  complex long double precision
F  :  complex single precision
I  :  unsigned integer
H  :  unsigned short
L  :  unsigned long integer
O  :  object
Q  :  unsigned long long integer
S  :  string
U  :  unicode
V  :  void
b  :  signed char
d  :  double precision
g  :  long precision
f  :  single precision
i  :  integer
h  :  short
l  :  long integer
q  :  long long integer
class arkouda.ubyte(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned char.

Character code:

'B'

Canonical name:

numpy.ubyte

Alias on this platform (Linux x86_64):

numpy.uint8: 8-bit unsigned integer (0 to 255).

bit_count(*args, **kwargs)

uint8.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint8(127).bit_count()
7
class arkouda.uint(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.uint16(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned short.

Character code:

'H'

Canonical name:

numpy.ushort

Alias on this platform (Linux x86_64):

numpy.uint16: 16-bit unsigned integer (0 to 65_535).

bit_count(*args, **kwargs)

uint16.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint16(127).bit_count()
7
class arkouda.uint32(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned int.

Character code:

'I'

Canonical name:

numpy.uintc

Alias on this platform (Linux x86_64):

numpy.uint32: 32-bit unsigned integer (0 to 4_294_967_295).

bit_count(*args, **kwargs)

uint32.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint32(127).bit_count()
7
class arkouda.uint64(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.uint8(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned char.

Character code:

'B'

Canonical name:

numpy.ubyte

Alias on this platform (Linux x86_64):

numpy.uint8: 8-bit unsigned integer (0 to 255).

bit_count(*args, **kwargs)

uint8.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint8(127).bit_count()
7
class arkouda.uintc(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned int.

Character code:

'I'

Canonical name:

numpy.uintc

Alias on this platform (Linux x86_64):

numpy.uint32: 32-bit unsigned integer (0 to 4_294_967_295).

bit_count(*args, **kwargs)

uint32.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint32(127).bit_count()
7
class arkouda.uintp(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned long.

Character code:

'L'

Canonical name:

numpy.uint

Alias on this platform (Linux x86_64):

numpy.uint64: 64-bit unsigned integer (0 to 18_446_744_073_709_551_615).

Alias on this platform (Linux x86_64):

numpy.uintp: Unsigned integer large enough to fit pointer, compatible with C uintptr_t.

bit_count(*args, **kwargs)

uint64.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint64(127).bit_count()
7
class arkouda.ulonglong(value)

Bases: numpy.unsignedinteger

Signed integer type, compatible with C unsigned long long.

Character code:

'Q'

bit_count(*args, **kwargs)
arkouda.union1d(pda1: arkouda.groupbyclass.groupable, pda2: arkouda.groupbyclass.groupable) arkouda.groupbyclass.groupable[source]

Find the union of two arrays/List of Arrays.

Return the unique, sorted array of values that are in either of the two input arrays.

Parameters:
  • pda1 (pdarray/Sequence[pdarray, Strings, Categorical]) – Input array/Sequence of groupable objects

  • pda2 (pdarray/List) – Input array/sequence of groupable objects

Returns:

Unique, sorted union of the input arrays.

Return type:

pdarray/groupable

Raises:
  • TypeError – Raised if either pda1 or pda2 is not a pdarray

  • RuntimeError – Raised if the dtype of either array is not supported

Notes

ak.union1d is not supported for bool or float64 pdarrays

Examples

>>>
# 1D Example
>>> ak.union1d(ak.array([-1, 0, 1]), ak.array([-2, 0, 2]))
array([-2, -1, 0, 1, 2])
#Multi-Array Example
>>> a = ak.arange(1, 6)
>>> b = ak.array([1, 5, 3, 4, 2])
>>> c = ak.array([1, 4, 3, 2, 5])
>>> d = ak.array([1, 2, 3, 5, 4])
>>> multia = [a, a, a]
>>> multib = [b, c, d]
>>> ak.union1d(multia, multib)
[array[1, 2, 2, 3, 4, 4, 5, 5], array[1, 2, 5, 3, 2, 4, 4, 5], array[1, 2, 4, 3, 5, 4, 2, 5]]
arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, pdarray, pdarray, int][source]

Find the unique elements of an array.

Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.

Parameters:
  • pda ((list of) pdarray, Strings, or Categorical) – Input array.

  • return_groups (bool, optional) – If True, also return grouping information for the array.

  • assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step

  • return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups

Returns:

  • unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.

  • permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)

  • segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)

Raises:
  • TypeError – Raised if pda is not a pdarray or Strings object

  • RuntimeError – Raised if the pdarray or Strings dtype is unsupported

Notes

For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.

Examples

>>> A = ak.array([3, 2, 1, 1, 2, 3])
>>> ak.unique(A)
array([1, 2, 3])
arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, pdarray, pdarray, int][source]

Find the unique elements of an array.

Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.

Parameters:
  • pda ((list of) pdarray, Strings, or Categorical) – Input array.

  • return_groups (bool, optional) – If True, also return grouping information for the array.

  • assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step

  • return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups

Returns:

  • unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.

  • permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)

  • segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)

Raises:
  • TypeError – Raised if pda is not a pdarray or Strings object

  • RuntimeError – Raised if the pdarray or Strings dtype is unsupported

Notes

For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.

Examples

>>> A = ak.array([3, 2, 1, 1, 2, 3])
>>> ak.unique(A)
array([1, 2, 3])
arkouda.unique(pda: groupable, return_groups: bool = False, assume_sorted: bool = False, return_indices: bool = False) groupable | Tuple[groupable, pdarray, pdarray, int][source]

Find the unique elements of an array.

Returns the unique elements of an array, sorted if the values are integers. There is an optional output in addition to the unique elements: the number of times each unique value comes up in the input array.

Parameters:
  • pda ((list of) pdarray, Strings, or Categorical) – Input array.

  • return_groups (bool, optional) – If True, also return grouping information for the array.

  • assume_sorted (bool, optional) – If True, assume pda is sorted and skip sorting step

  • return_indices (bool, optional) – Only applicable if return_groups is True. If True, return unique key indices along with other groups

Returns:

  • unique ((list of) pdarray, Strings, or Categorical) – The unique values. If input dtype is int64, return values will be sorted.

  • permutation (pdarray, optional) – Permutation that groups equivalent values together (only when return_groups=True)

  • segments (pdarray, optional) – The offset of each group in the permuted array (only when return_groups=True)

Raises:
  • TypeError – Raised if pda is not a pdarray or Strings object

  • RuntimeError – Raised if the pdarray or Strings dtype is unsupported

Notes

For integer arrays, this function checks to see whether pda is sorted and, if so, whether it is already unique. This step can save considerable computation. Otherwise, this function will sort pda.

Examples

>>> A = ak.array([3, 2, 1, 1, 2, 3])
>>> ak.unique(A)
array([1, 2, 3])
arkouda.unregister(name: str) str[source]
arkouda.unregister_all(names: list)[source]

Unregister all names provided

Parameters:

names (list) – List of names used to register objects to be unregistered

Return type:

None

arkouda.unregister_pdarray_by_name(user_defined_name: str) None[source]

Unregister a named pdarray in the arkouda server which was previously registered using register() and/or attahced to using attach_pdarray()

Parameters:

user_defined_name (str) – user defined name which array was registered under

Return type:

None

Raises:

RuntimeError – Raised if the server could not find the internal name/symbol to remove

Notes

Registered names/pdarrays in the server are immune to deletion until they are unregistered.

Examples

>>> a = zeros(100)
>>> a.register("my_zeros")
>>> # potentially disconnect from server and reconnect to server
>>> b = ak.attach_pdarray("my_zeros")
>>> # ...other work...
>>> ak.unregister_pdarray_by_name(b)
class arkouda.unsignedinteger(value)

Bases: numpy.integer

Abstract base class of all unsigned integer scalar types.

arkouda.unsqueeze(p)[source]
arkouda.update_hdf(columns: Mapping[str, arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray] | List[arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | arkouda.segarray.SegArray], prefix_path: str, names: List[str] | None = None, repack: bool = True)[source]

Overwrite the datasets with name appearing in names or keys in columns if columns is a dictionary

Parameters:
  • columns (dict or list of pdarrays) – Collection of arrays to save

  • prefix_path (str) – Directory and filename prefix for output files

  • names (list of str) – Dataset names for the pdarrays

  • repack (bool) – Default: True HDF5 does not release memory on delete. When True, the inaccessible data (that was overwritten) is removed. When False, the data remains, but is inaccessible. Setting to false will yield better performance, but will cause file sizes to expand.

Raises:

RuntimeError – Raised if a server-side error is thrown saving the datasets

Notes

  • If file does not contain File_Format attribute to indicate how it was saved, the file name is checked for _LOCALE#### to determine if it is distributed.

  • If the datasets provided do not exist, they will be added

  • Because HDF5 deletes do not release memory, this will create a copy of the file with the new data

  • This workflow is slightly different from to_hdf to prevent reading and creating a copy of the file for each dataset

class arkouda.ushort(value)

Bases: numpy.unsignedinteger

Unsigned integer type, compatible with C unsigned short.

Character code:

'H'

Canonical name:

numpy.ushort

Alias on this platform (Linux x86_64):

numpy.uint16: 16-bit unsigned integer (0 to 65_535).

bit_count(*args, **kwargs)

uint16.bit_count() -> int

Computes the number of 1-bits in the absolute value of the input. Analogous to the builtin int.bit_count or popcount in C++.

>>> np.uint16(127).bit_count()
7
arkouda.value_counts(pda: arkouda.pdarrayclass.pdarray) tuple[source]

Count the occurrences of the unique values of an array.

Parameters:

pda (pdarray, int64) – The array of values to count

Returns:

  • unique_values (pdarray, int64 or Strings) – The unique values, sorted in ascending order

  • counts (pdarray, int64) – The number of times the corresponding unique value occurs

Raises:

TypeError – Raised if the parameter is not a pdarray

See also

unique, histogram

Notes

This function differs from histogram() in that it only returns counts for values that are present, leaving out empty “bins”. This function delegates all logic to the unique() method where the return_counts parameter is set to True.

Examples

>>> A = ak.array([2, 0, 2, 4, 0, 0])
>>> ak.value_counts(A)
(array([0, 2, 4]), array([3, 2, 1]))
arkouda.var(pda: pdarray, ddof: arkouda.numpy.dtypes.int_scalars = 0) numpy.float64[source]

Return the variance of values in the array.

Parameters:
  • pda (pdarray) – Values for which to calculate the variance

  • ddof (int_scalars) – “Delta Degrees of Freedom” used in calculating var

Returns:

The scalar variance of the array

Return type:

np.float64

Raises:
  • TypeError – Raised if pda is not a pdarray instance

  • ValueError – Raised if the ddof >= pdarray size

  • RuntimeError – Raised if there’s a server-side error thrown

See also

mean, std

Notes

The variance is the average of the squared deviations from the mean, i.e., var = mean((x - x.mean())**2).

The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.

arkouda.vecdot(x1: arkouda.pdarrayclass.pdarray, x2: arkouda.pdarrayclass.pdarray)[source]

Compute the generalized dot product of two vectors along the given axis. Assumes that both tensors have already been broadcast to the same shape.

Parameters:
Returns:

x1 vecdot x2

Return type:

pdarray

Examples

>>> a = ak.array([[1,2,3,4,5],[1,2,3,4,5]])
>>> b = ak.array(([2,2,2,2,2],[2,2,2,2,2]])
>>> ak.vecdot(a,b)
array([5 10 15 20 25])
>>> ak.vecdot(b,a)
array([5 10 15 20 25])
Raises:

ValueTypeError – Raised if x1 and x2 are not of matching shape or if rank of x1 < 2

class arkouda.void(value)

Bases: numpy.flexible

np.void(length_or_data, /, dtype=None)

Create a new structured or unstructured void scalar.

length_or_dataint, array-like, bytes-like, object

One of multiple meanings (see notes). The length or bytes data of an unstructured void. Or alternatively, the data to be stored in the new scalar when dtype is provided. This can be an array-like, in which case an array may be returned.

dtypedtype, optional

If provided the dtype of the new scalar. This dtype must be “void” dtype (i.e. a structured or unstructured void, see also defining-structured-types).

..versionadded:: 1.24

For historical reasons and because void scalars can represent both arbitrary byte data and structured dtypes, the void constructor has three calling conventions:

  1. np.void(5) creates a dtype="V5" scalar filled with five \0 bytes. The 5 can be a Python or NumPy integer.

  2. np.void(b"bytes-like") creates a void scalar from the byte string. The dtype itemsize will match the byte string length, here "V10".

  3. When a dtype= is passed the call is roughly the same as an array creation. However, a void scalar rather than array is returned.

Please see the examples which show all three different conventions.

>>> np.void(5)
void(b'\x00\x00\x00\x00\x00')
>>> np.void(b'abcd')
void(b'\x61\x62\x63\x64')
>>> np.void((5, 3.2, "eggs"), dtype="i,d,S5")
(5, 3.2, b'eggs')  # looks like a tuple, but is `np.void`
>>> np.void(3, dtype=[('x', np.int8), ('y', np.int8)])
(3, 3)  # looks like a tuple, but is `np.void`
Character code:

'V'

arkouda.vstack(tup: Tuple[arkouda.pdarrayclass.pdarray] | List[arkouda.pdarrayclass.pdarray], *, dtype: type | str | None = None, casting: Literal['no', 'equiv', 'safe', 'same_kind', 'unsafe'] = 'same_kind') arkouda.pdarrayclass.pdarray[source]

Stack a sequence of arrays vertically (row-wise).

This is equivalent to concatenation along the first axis after 1-D arrays of shape (N,) have been reshaped to (1,N).

Parameters:
  • tup (Tuple[pdarray]) – The arrays to be stacked

  • dtype (Optional[Union[type, str]], optional) – The data-type of the output array. If not provided, the output array will be determined using np.common_type on the input arrays Defaults to None

  • casting ({"no", "equiv", "safe", "same_kind", "unsafe"], optional) – Controls what kind of data casting may occur - currently unused

Returns:

  • pdarray – The stacked array

arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, B: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical[source]

Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.

Parameters:
Returns:

Values chosen from A where the condition is True and B where the condition is False

Return type:

pdarray

Raises:
  • TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied

  • ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal

Examples

>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 == 5
>>> ak.where(cond,a1,a2)
array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = 10
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)])
>>> s2 = 'str 21'
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,s1,s2)
array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)]))
>>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)]))
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,c1,c2)
array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])

Notes

A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda

arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, B: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical[source]

Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.

Parameters:
Returns:

Values chosen from A where the condition is True and B where the condition is False

Return type:

pdarray

Raises:
  • TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied

  • ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal

Examples

>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 == 5
>>> ak.where(cond,a1,a2)
array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = 10
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)])
>>> s2 = 'str 21'
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,s1,s2)
array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)]))
>>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)]))
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,c1,c2)
array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])

Notes

A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda

arkouda.where(condition: arkouda.pdarrayclass.pdarray, A: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical, B: str | float | numpy.float64 | numpy.float32 | int | numpy.int8 | numpy.int16 | numpy.int32 | numpy.int64 | numpy.uint8 | numpy.uint16 | numpy.uint32 | numpy.uint64 | arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical) arkouda.pdarrayclass.pdarray | arkouda.strings.Strings | Categorical[source]

Returns an array with elements chosen from A and B based upon a conditioning array. As is the case with numpy.where, the return array consists of values from the first array (A) where the conditioning array elements are True and from the second array (B) where the conditioning array elements are False.

Parameters:
Returns:

Values chosen from A where the condition is True and B where the condition is False

Return type:

pdarray

Raises:
  • TypeError – Raised if the condition object is not a pdarray, if A or B is not an int, np.int64, float, np.float64, pdarray, str, Strings, Categorical if pdarray dtypes are not supported or do not match, or multiple condition clauses (see Notes section) are applied

  • ValueError – Raised if the shapes of the condition, A, and B pdarrays are unequal

Examples

>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 1, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = ak.ones(9, dtype=np.int64)
>>> cond = a1 == 5
>>> ak.where(cond,a1,a2)
array([1, 1, 1, 1, 5, 1, 1, 1, 1])
>>> a1 = ak.arange(1,10)
>>> a2 = 10
>>> cond = a1 < 5
>>> ak.where(cond,a1,a2)
array([1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> s1 = ak.array([f'str {i}' for i in range(10)])
>>> s2 = 'str 21'
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,s1,s2)
array(['str 0', 'str 21', 'str 2', 'str 21', 'str 4', 'str 21', 'str 6', 'str 21', 'str 8','str 21'])
>>> c1 = ak.Categorical(ak.array([f'str {i}' for i in range(10)]))
>>> c2 = ak.Categorical(ak.array([f'str {i}' for i in range(9, -1, -1)]))
>>> cond = (ak.arange(10) % 2 == 0)
>>> ak.where(cond,c1,c2)
array(['str 0', 'str 8', 'str 2', 'str 6', 'str 4', 'str 4', 'str 6', 'str 2', 'str 8', 'str 0'])

Notes

A and B must have the same dtype and only one conditional clause is supported e.g., n < 5, n > 1, which is supported in numpy is not currently supported in Arkouda

arkouda.write_log(log_msg: str, tag: str = 'ClientGeneratedLog', log_lvl: LogLevel = LogLevel.INFO)[source]

Allows the user to write custom logs.

Parameters:
  • log_msg (str) – The message to be added to the server log

  • tag (str) – The tag to use in the log. This takes the place of the server function name. Allows for easy identification of custom logs. Defaults to “ClientGeneratedLog”

  • log_lvl (LogLevel) – The type of log to be written Defaults to LogLevel.INFO

See also

LogLevel

arkouda.xlogy(x: arkouda.pdarrayclass.pdarray | numpy.float64, y: arkouda.pdarrayclass.pdarray)[source]

Computes x * log(y).

Parameters:
  • x (pdarray or np.float64) – x must have a datatype that is castable to float64

  • y (pdarray)

Return type:

arkouda.pdarrayclass.pdarray

Examples

>>> import arkouda as ak
>>> ak.connect()
>>> from arkouda.scipy.special import xlogy
>>> xlogy( ak.array([1, 2, 3, 4]),  ak.array([5,6,7,8]))
array([1.6094379124341003 3.5835189384561099 5.8377304471659395 8.317766166719343])
>>> xlogy( 5.0, ak.array([1, 2, 3, 4]))
array([0.00000000000000000 3.4657359027997265 5.4930614433405491 6.9314718055994531])
arkouda.zero_up(vals)[source]

Map an array of sparse values to 0-up indices.

Parameters:

vals (pdarray) – Array to map to dense index

Returns:

aligned – Array with values replaced by 0-up indices

Return type:

pdarray